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ABSTRACT 


As digitization of data becomes more prevalent, the demands on existing 
communications networks and computer systems to cope with this increase become 
overwhelming. Currently, the speech compression problem is handled using the CELP 
(Code Excited Linear Prediction) scheme and its derivatives. Such techniques are the 
most frequently used for speech compression at medium-to-low rate ranges. Recent 
research conducted into the area of cosine packets has proven this field to be readily 
adaptable to speech compression and coding. In this thesis, speech compression schemes 
are developed using cosine-packet decomposition, minimum entropy basis selection, and 
an adaptive thresholding scheme for selecting coefficients. In addition, voiced-unvoiced 
segmentation and a denoising scheme are implemented. Test results show high 
compression ratios (1:50) with a good quality of reconstructed speech. 
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I. INTRODUCTION 


Speech compression allows smaller bandwidth, higher data rates or a combination 
of these atributes. It can also be used to store speech like data in a compact form. 

This thesis develops speech compression schemes based on the Local 
Trigonometric Transform [2], which use an adaptive thresholding scheme proposed in 
this work. These schemes perform a time partition of the original speech data first, 
according to a maximum depth selected by the user. An experimentally derived, optimum 
depth is proposed, based on the results of tests with several words and phonemes (defined 
in Chapter II). Following the time partitioning, a basis obtained via the minimum entropy 
best basis algorithm is selected. In order to perform compression, coefficients are selected 
according to an adaptive thresholding scheme, which varies the compression percentage 
depending on the energy and fi'equency content of each interval. The intervals are 
classified by a voiced-unvoiced segmentation algorithm. Depending on their 
classification, selection of coefficients is made in such a way that more coefficients are 
preserved for the voiced than for the unvoiced intervals. Then, these coefficients are 
encoded using uniform quantizers and Huffinan coding to achieve average compression 
ratios of 1:50. In addition, two denoising schemes are proposed to minimize effects of 
equipment noise below 120 Hz, thereby improving the sound quality. 

In a typical scenario, users of the proposed schemes will be able to adjust speech 
quality and transmission bandwidth, based on the current channel bandwidth available. 
They will be provided with the parameters that maximize the compression ratio, and 
minimize the required bandwith at an acceptable speech quality. Using lower bit rate 
coding reduces the transmission bandwith of the signal and may prove to be quite useful 
in partial band jamming environments where the available charmel bandwidth may be 
limited. It is understood that the schemes proposed may be useful for military 
applications where the vmderstanding of the message is more important than the overall 
quality of the sound. This thesis concentrates on finding the best possible compression 
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ratio, while still keeping an acceptable sound quality. In this work, sound quality is 
defined in terms of a mean square error as well as in terms of a proposed quality measure. 
Extentions of the proposed techniques lead to data storage improvements and they can 
also easily be adopted to cryptographic applications. The thesis is organized in the 
following manner. Chapter II presents an introduction to speech processing, where the 
concepts of phonemes and coarticulation effects are introduced and illustrated. Chapter 
III introduces the Local Trigonometric Transform and presents the Local Cosine 
Transform adopted in our work. The Local Cosine Transform can be viewed as a basic 
building block for the more complex Cosine Packet Transform, which has been used 
recently in speech applications [2]. The Cosine Packet Transform can also be viewed as a 
dual operation of the Wavelet Packet Transform [2]. Both packet schemes are presented, 
discussed and compared in Chapter IV. The Wavelet and Cosine Packet Transforms 
involve the selection of a particular basis “best” matched to the signal under study for 
compression applications. This choice of basis is carried out via the Best Basis algorithm, 
which is presented in Chapter V. Chapter VI presents the denoising and compression 
schemes investigated in this work. Denoising allows for enhancement of the audio quality 
of the speech signals when noise is present. Chapter VII describes the encoding schemes 
used to compress the speech information. Chapter VIII first discusses the experiments 
and parameters used to test our denoising and compression schemes. Next, it presents the 
results obtained using various phonemes, words and sentences. The data base consists of 
a limited collection of American-English words, some Portuguese words and some 
typical voiced and unvoiced sound segments. Some of the more elaborate data sets 
consist of complete sentences and dialogues. Finally, we compare compression results 
obtained with our Cosine Packet scheme and those obtained with the Wavelet Packet 
scheme using a “Daubechies” basis function [17]. Results show that the Cosine Packet 
Transform outperforms the Wavelet Packet Transform on the speech segments considered 
in this study. Finally, Chapter IX contains the conclusions and final considerations. All 
computer algorithms are listed in the Appendix. 
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II. INTRODUCTION TO SPEECH PROCESSING 


One of the principal differentiating features of any speech sound is excitation [1]. 
Two elemental excitation types are present in speech data: (1) voiced and (2) unvoiced. 
Voiced sounds have high energy and low frequency, while unvoiced soimds have low 
energy and high frequency. Another important characteristic of speech signals is that they 
are locally stationary. 

The basic theoretical unit for describing how speech conveys linguistic meaning is 
called a phoneme. Each language has its own set of phonemes. For example, American 
English has about 42 phonemes, while Brazilian Portuguese has about 51 phonemes (Rio 
de Janeiro region). They are made of vowels, semivowels, diphthongs, and consonants. In 
general, the duration of each phoneme may vary from 15 to 400 milliseconds, depending 
on the sound produced and the way it is pronounced. For example, vowels can vary 
largely in duration, typically from 40 to 400 milliseconds. 

The transition from one phoneme to another is not made abruptly or 
independently of adjacent phonemes. Actually, adjacent phonemes have a strong 
influence on the manner in which the transition takes place. The term used to refer to the 
change in phoneme articulation and acoustics that is caused by the influence of another 
phoneme is coarticulation. 

Since this research investigates speech compression, there are two main 
requirements. First, we need to be able to split a speech signal into its smallest locally 
stationary “cells” constituted by phonemes, and represent them in a minimal way with 
good fidelity. Second, we need to preserve coarticulation effects as much as possible (i.e., 
we need to preserve the smooth transition from one phoneme to the next) . 

Figure 2.1 illustrates the coarticulation process for the sound /issos/. The top plot 
represents time-domain speech. The middle plot represents the voiced and unvoiced 
portions of this soimd obtained using the zero-crossing rate and the short-time energy 
contained in the soimd [1]. The unvoiced portions are -ss- and -s-, corresponding to the 
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phoneme /s/. The high frequency and low energy of unvoiced segments are illustrated by 
the low short-time energy and high zero-crossing rates. The voiced portions of the sound 
are the phonemes III and /o/. The low frequency and high energy of voiced phonemes are 
illustrated by the high short-time energy and low zero-crossing rate. The bottom plot 
shows the spectrogram obtained using a Hanning time window of length 256 samples 
with an overlap of 128. Note the coarticulation effects present, which allow for smooth 
transitions between phonemes. For example, the transition from III to /s/ occurs through a 
“link,” which takes place in a high frequency portion of the spectrum, and which is an 
example of anticipatory coarticulation (or right-to-left coarticulation). This means that the 
articulator has moved from the present phoneme (/i/) toward a position (higher frequency) 
that is more appropriate for the following phoneme (/s/). 
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Figure 2.1 Sound “ISSOS,” male non-native speaker; top plot: Time domain 
representation; middle plot: Short time energy and zero-crossing representation; 
bottom plot: Spectrogram of “ISSOS” using a Hanning time window of length 256 
samples with overlap 128, fs = 8 KHz. 
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III. THE LOCAL TRIGONOMETRIC TRANSFORM 


This chapter discusses the main concepts related to the Local Trigonometric 
Transform theory and its implementation. Much of the mathematical rigor is omitted, and 
emphasis is placed on the basic theory and its application to speech processing. This 
chapter is divided into six sections. The first provides an introduction, and the second 
presents some basic definitions about the rising/cutoff fimction. The third section defines 
the folding and unfolding operations that are used for the transform [2]. The fourth 
describes the Continuous Transform and its main mathematical properties. The fifth 
defines the Discrete Transform. Finally, the last section applies these concepts and 
describes how the transform may be performed by using orthonormal bases to allow for 
signal analysis and synthesis. 

A. INTRODUCTION 

In order to analyze small portions of the speech signal, it must be partitioned in 
time. The local transform defined in this chapter applies a “local cosine,” which is a basis 
function that allows the signal to be cut into time slices. As first defined by Malvar in 
1987 [3], the “local cosines” provided a regularly spaced partition in time. Later, Coifinan 
and Meyer [4] and Meyer [5] tackled the problem of modifying regular constructions to 
obtain windows with variable lengths that could be defined arbitrarily. They began by 
partitioning time into adjacent intervals [oj^ ctj+j], as illustrated in Figure 3.1. Figure 3.2 
shows in more detail how the windows may be combined while still preserving the 
smoothness and integrity of the signal. The windows used are essentially the intervals [aj 
ttj+i]. The disjoint intervals [aj - Sj, Oj + Sj ] allow the windows to overlap. In summary, 
the local cosines (called “Malvar wavelets”) are constructed with a rising duration (28j), a 
stationary period (At), and a decay (which lasts 2Sj+i). The ability to arbitrarily and 
independently choose the duration of the rising and decaying, as well as the stationary 
section, is exactly what makes the Malvar wavelets different from other well-known 
wavelets (e.g., Gabor or Daubechies) [5]. Of course, it is important to use this ability 
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efficiently. This choice will be discussed in the following chapters, where we focus on the 
best basis for decomposition of the signal. 
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Figure 3.1 Arbitrary Partition of Time into Adjacent Intervals 



Figure 3.2 Overlapping windows of arbitrary size 


B. THE RISING / CUTOFF FUNCTION 

The well-known Discrete Cosine Transform (DCT) has, as its basis function, a 
“block cosine” (i.e., a rectangular window that multiplies the cosine function). The 
functions obtained by the block cosine result in a discontinuity or an abrupt variation in 
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the signal. As a result, we have discontinuities at the block boundaries of the reconstructed 
signal. The effects produced include the so-called “blocking effect” in image coding, and 
the “clicking sounds” in speech coding [6]. These problems can be avoided by defining a 
window based on a function that allows for a smooth transition from zero to the amplitude 
of the cosine (on the left edge), as well as from that amplitude to zero (on the right edge). 

The function r is defined as r = r(0 in the class C‘‘(i?), for some 0<d<co, 
satisfying the following conditions: 


’■(<) P + I ri-t) p = 1 for all t £ R; r(t) = 



(3.1) 


It is called a rising cutoff function because r(t) monotonically increases from zero to one 
over the domain of / from - oo to + oo. That function is presented in Figure 3.3. 



time 


Figure 3.3 The rising cutoff function 


9 





















C. FOLDING AND UNFOLDING 

The folding operator U and its adjoint xmfolding operator U* are defined as follows: 

f if t > 0 

urn = (3.2) 

[ r{-t)f{t) ift <0 

f ^( 0/(0 - if t > 0 ,3 .. 

u-f{t) = 

[ +r{t)f{-t), iff< 0 . 

Observe that Uf(t) = /(/), and U*f (i) =/ (t) , if /> 7 or if t < -1. Also, U*U f (t) = 
UU* fit) = (I /-(O P + I ri-t) I ^ ) fit) = fit), for all / 0 , so that U and U* are 

isomorphisms of L (7?). This means that one operator is the inverse of the other. 

Figure 3.4 illustrates the unfolding operation on a block cosine. Figure 3.4a shows 
a block cosine. Figures 3.4b and 3.4c illustrate the cosine unfolded at its left edge, and 
unfolded at both edges, respectively. 

Figure 3.5 presents a block cosine and a block sine after periodic folding. The 
purpose of folding is to prepare the function intervals, so that the adjacent windows can be 
overlapped further without changing the function in the overlapping interval. 


10 




(b) Left edge unfolded (c) Both edges unfolded 

Figure 3.4 Unfolding operator in a block cosine 
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(a) 


(b) 


Figure 3.5 (a) Block cosine and (b) Block sine, both after periodic folding 


To extend the concept of folding and unfolding in an interval, the operators now 
can be shifted and dilated so that their action takes place on an arbitrary interval (a - s, a 
+ e). Now, after partitioning the time by periodically folding the left and right edges of 
each interval, all the adjacent component windows can be unfolded and overlapped. The 
window formed by the rising cutoff ftmction is called a bell. Figure 3.6 displays two small 
bells (called child bells) overlapped and one inverted large bell (called the parent bell) 
below, showing that it is possible to preserve both the smoothness between intervals and 
the signal integrity (with no loss of information), if each interval is unfolded and then 
overlapped. This explains how parent windows may be split into two child windows (in 
the decomposition phase), and how two child windows may be combined to form one 
parent window (in the reconstruction phase). This property is particularly important when 
the concept of the “cosine packets” is introduced. 
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Figure 3.6 Two child bells overlapped and one inverted parent bell 


D. THE CONTINUOUS LOCAL TRIGONOMETRIC TRANSFORM 

1. Properties 

The time window used in the Local Trigonometric Transform can have both 
smoothness and a controlled length so that properties such as time and frequency 
resolution also may be controlled. This can be implemented simply by changing the 
equation of the window. By combining windows of arbitrary size (represented by local 
cosines, i.e., block cosines unfolded at both edges), it is possible to obtain a smooth 
orthogonal basis. Observe that each window is well localized in time, as well as in 
frequency. Its temporal support region is the width of that interval given by [aj - Sj , aj+i 
+ Sj+i] and, thus, it has position uncertainty at most equal to that width (Figure 3.2). Figure 
3.7 presents three different bells, which are called functions qi], r^j, and r[ 5 ]. Figure 3.8 
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presents the positive half of the real part of the Fourier Transform of the functions given in 
Figure 3.7. 




Figure 3.7 Three different bells; (a) r[j]; (b) r[ 3 j; (c) r[ 5 ] 


Note that the sidelobes increase as the roll-off of the time window 


increases. 
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Figure 3.8 Fourier transforms of the bells of Figure 3.7 
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2. The Local Transform 

The Continuous Local Trigonometric Transform is based on a set of orthonormal 


basis functions that allow for a variable-length time window while still maintaining a 
small time-frequency bandwidth product. The Transform can be either a local cosine or a 
local sine. Since the local cosine has been chosen, the definition of the so-called “block 
cosine” at half-integer frequency is given as follows: 


C„ (t) = cos [ 71 (n + 1/2) t ], 


(3.4) 


where n is an integer, and t is restricted to the interval [0,1]. 

As can be observed from the right side of Figure 3.4c, unfolding the block cosine 
at the edges gives it the necessary smooth characteristics that contribute to a good 
frequency resolution for that transform. Basically, smoothness is obtained by a smooth 
cutoff by sine iteration [2], defined by; 



(3.5) 


(3.6) 


Since r^jj is smooth on (-1,1) with one vanishing first derivative at the boundary 
points, the envelope (referred to as the bell in [5]) has a continuous derivative on R . Based 
on the recursion in Equation (3.6), r^^j can be used with i > 1 to obtain additional 


derivatives [5]. Actually, it can be shown that r[,](/) has 2' * vanishing derivatives, r^jj is 
used, since it allows good resolution and has very small side lobes. 


Thus, the local cosine is defined as : 
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where Oj and aj +, are the interval edges, 8j and Sj+, are the action radii of the operators for 
both edges, and rj, rj+i is the rising function r^jj, applied at both edges of the interval. Note 
that the local cosine as defined by Equation (3.7) is the result of the unfolding operation at 
both edges of the “block cosine,” i.e.. 


= U (rj,aji €j) U (rj+,aj+,€j+i)- lij(0Cnj(0, 


where: 

• Cnj(0 represents the block cosine function for an interval beginning at 

edgej; 

•U*(.) is the unfolding operator applied at the left (j) and right edge (j+1) of 

the interval. 

Thus, the Continuous Local Trigonometric Transform is the inner product (f,v}/„ j), 
where v|/n j is the local cosine defined above. 

Instead of computing in that manner, one may fold the function first, and then 
obtain the inner product with the regular “block cosine,” as in the expression below: 

av|/„j> = <t/jt/j.,/lijC„,3>. (3.8) 

In practice, this simple observation has great importance, since it means that/ can 
be preprocessed by folding, and the local cosine transform can be computed with an 
ordinary cosine transform [2]. 

It is also important to observe that, by defining the transformation as an inner 
product, what is measured is the amount of “similarity” between the signal /(/) and the 
basis function €„ j. This is one of the key attributes that make the local cosine transform 
convenient for the transformation of speech signals and, therefore, good for compression 
and coding. The fact that speech can be considered a locally stationary signal with a 
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reasonable correlation to sines and cosines may explain some of the good results when 
using a Local Trigonometric Transform. 


E. THE DISCRETE COSINE TRANSFORM 

By replacing just the variables with integers, and by using the discrete cosine 
transform, it is possible to obtain discrete versions of the local cosine. So Equation (3.9) is 
exactly the same formula as Equation (3.7), but with the variables replaced by integer 
values. In Equation (3.9) it is assumed that: 

• aj< ttj+i, where ttj andaj+i are integers; 

• the signal is sampled at integer points t, Oj < t < aj+i, which gives (aj+i - Oj) 
samples; 

• Tj and rj+i are the rising functions qi], applied at both edges of the interval; 

• Sj > 0 and Sj+i > 0, with Ej + Sj+i < number of samples to insure that the action 
regions are disjoint. 

Equation (3.9) also makes a distinction between the left and right endpoints, 
because sampling is done at the left endpoint of each interval. If sampling is done in the 
middle of the intervals (which can be done by taking the function in Equation (3.7) and 
replacing every instance of t with t+1/2), it will be more symmetric, and the basis 
functions will be cosines sampled between grid points. The result is the following discrete 
local cosine basis function: 




(DCT-IV) 
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’■j+i 
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'Tr{t + 5)(t + i - aj)' 
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F. 


APPLICATION TO SIGNAL ANALYSIS/SYNTHESIS 


Given an arbitrary partitioning of a signal in time, it is possible to construct several 
smooth orthogonal bases, using the local cosine transform as the basis fimction. The 
scheme that leads to the best partition and the best basis for this application will be 
introduced in the next chapter. This section explains how the DCT-IV can be used for an 
analysis in the frequency domain and for further synthesis in the time domain. 

As mentioned in sections “C” and “D”, the signal is first folded at the left and 
right ends of each interval. Then, an ordinary DCT-IV transform is used to compute the 
Local Cosine Transform for each of the windows obtained. Now, it becomes possible to 
analyze each time window using the frequency spectrum (from DC to fjl, where f^ is the 
sampling frequency). To reconstruct the signal, the DCT-IV is applied to obtain the 
inverse. As in the decomposition phase, the transform is computed first with the regular 
“block cosine,” and then the intervals are unfolded, instead of using the local cosine. By 
periodically unfolding the left edges of the current interval and the right edge of the 
following one, the smoothness and integrity of the function are preserved, allowing the 
time domain function to be reconstructed. 
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IV. WAVELET AND COSINE PACKET TRANSFORMS 


This chapter presents the Wavelet Transform and two general time-frequency 
analysis schemes: the Wavelet Packet Transform and the Cosine Packet Transform . 

A. INTRODUCTION 

The goal of this thesis is to obtain the scheme best suited for the decomposition 
and reconstruction of speech signals, in particular, one that can decompose a speech 
signal into an orthonormal basis function. First, the Wavelet Transform (WT) and its 
main properties and characteristics are discussed. Next, the general concept of the 
Wavelet Packet Transform (WPT) is introduced. Finally, the Cosine Packet Transform 
(CPT) is presented. This last scheme initially performs a time split, as opposed to 
transforms that first split the signal in the frequency domain. 

B. THE WAVELET TRANSFORM 

In the Wavelet Transform (WT) algorithm, the sampled data set is passed through 
the low-pass and high-pass filters with complementary bandwidths, known as quadrature 
mirror filter (QMF) pairs [7]. The outputs of both filters are decimated by a factor of two. 
So, at each scale, we have a set of high-pass filtered data and a set of low-pass filtered 
data. Each of these sets has half as many elements as the original data set, as a 
consequence of the decimation. The low-pass filtered data can be used as the data input 
for another pair of filters identical to the first pair, generating another set of low- and 
high-pass coefficients at the next lower level of scale [8]. 

This process can continue until the set of original coefficients has been reduced to 
the minimal scale level, which is two coefficients. Figure 4.1 presents the pyramid 
algorithm of the WT. Figure 4.2 shows how a unit interval of length 2^ samples can be 
decomposed to obtain a maximum of j levels of transform data. Figure 4.3 presents the 
tiling diagram that corresponds to the WT decomposition. This shows that the WT works 
well if the signal is composed of strong components of short duration, i.e., bursts. This 
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means that the WT is a good detector of transients. It also works well if the signal is 
composed of low-frequency components of long duration [9]. 

As stated earlier, speech is composed of portions of either high frequency or low 
frequency, both with a typical minimum duration of about 15 milliseconds. These 
characteristics indicate that the WT may not be the best scheme for speech signal 
analysis. 


21 2:1 21 21 21 



Figure 4.1 WT implementation: A bank of QMF pairs 
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Figure 4.2 Wavelet transform: decomposing 2^ samples into a maximum of j levels 
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Figure 4.3 WT tiling diagram 


C. THE WAVELET PACKET TRANSFORM 

The WT is not the only way to split the signal in the frequency domain. The Short 
Time Fourier Transform (STFT), for example, is another possible scheme. However, in 
the STFT, both the time and frequency resolution are kept constant by the choice of the 
time window length (Figure 4.4). 

Actually, both the WT and the STFT can be viewed as part of a general scheme 
called the Wavelet Packet Transform (WPT), which is a collection of possible sets of 
orthonormal basis functions. 
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frequency 



Figure 4.5 depicts the general tree structure for the WPT. Note that the heavy lines 
indicate the graph that forms the WPT basis. The symbol L or H has been assigned to 
each half frequency division, depending on whether it is a high- or low-frequency band. 
Following the tree structure, we have assigned those symbols sequentially, following the 
same rule. Note that the WT basis consists of the subspaces H, LH, LLH, LLLH and 
LLLL. 
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Figure 4.5 General tree structure for the WPT 

The sequences L, LL and LLL are intermediate steps leading to the generation of 
the subspaces of the wavelet basis at the lower levels. 

Since the frequency splitting results in the low- or high-pass version of the filtered 
data (i.e., either half branches of the tree), j2^ graphs representing different orthonormal 
bases can be created. Figure 4.6 presents three different Wavelet Packet decompositions. 
The basis is a subband decomposition scheme [10], where the basis obtained is composed 
of the eight bottom divisions. The second is another possible decomposition leading to an 
orthonormal basis. The third decomposition is exactly the opposite of that obtained using 
the WT. Figure 4.7 illustrates the tiling diagram that corresponds to the third 
decomposition. Note the higher frequency resolution for higher frequencies, and the 
higher time resolution for lower frequencies. 
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Figure 4.6 Three different wavelet packet decompositions leading to three different bases 
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Figure 4.7 Tiling diagram for the decomposition of figure 4.6c 
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D. THE COSINE PACKET TRANSFORM 


The Cosine Packet Transform (CPT) is a scheme that allows for a time-splitting 
decomposition prior to the frequency transformation. If one imagines the original signal 
in the time domain being split successively into two halves at each iteration, a tree 
configuration will result (Figure 4.8). If the transform imposes no restriction on the 
support intervals of the window envelopes, the tree does not need to be homogeneous. 
This means that the windows do not need to be combined in the same way (either in pairs 
or in any other specific manner). Also, the subspaces do not need to be of equal size. So, 
in analogy to the wavelet packets case, one is now faced with a large number of possible 
orthonormal basis configurations, each one of them being considered as a cosine packet. 
It is important to observe that in the cosine packets case, the windows do not need to be 
of a dyadic size, they may be of an arbitrary size. However, in this thesis, only dyadic 
sized windows are considered. 


Levels 



time 


Figure 4.8 Cosine packet transform; The tree configuration 


We also note that, as one goes down the tree, time resolution is improved by a 
factor of two at each layer, while frequency resolution is decreased by a factor of two at 
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each iteration. Figure 4.9 presents the tiling diagram that corresponds to the tree 
configuration shown in Figure 4.8. The CPT works in such a way that, after time splitting 
to a certain depth, a basis is selected by some criterion. Then, for each time window, the 
DCT-IV transform is applied. 


frequency 
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V. THE BEST BASIS ALGORITHM 


A. INTRODUCTION 

When a choice of bases exists for the representation of a signal, it is possible to 
determine the best one using some predetermined criterion. The criterion will always 
depend on the type of signal and the user’s objective. In this case, the signal is speech 
and the objective is to minimize the number of symbols used to represent the information 
contained in a given interval (i.e., it is desirable to minimize the entropy of that interval). 
The “best basis” criterion allows for the minimization of some information costs options, 
including the entropy minimization method [6,11]. 

We recall that the entropy of a vector u = { u{k) } is defined by : 

mu) = i:p(k)\og(l/p(k)), (5.1) 

k 

where p{k) = | uQc) \ / ||wl| is a normalized energy of the k element of the sequence, and 
p log 1/p is set to 0, if p = 0. H(u) is the entropy of the probability distribution fimction 
(or pdf) given by p. Note that H(u) is not a an information cost functional, i.e., it is not a 
direct function of the sequence {u(k)}. But the functional 

/(«) = Zl«(*)hog(l/l«(^)r) 

k 

is a direct function. If l(u) is minimized, then H{u) is also minimized in the expression: 

//(u) = ||u|r";(u) + log||u|p. (5.2) 

B. THE BEST BASIS ALGORITHM METHOD 

Initially, the algorithm computes the entropy obtained in all intervals or “nodes” 
of the tree. Figure 5.1 presents an example of the cosine packet tree with corresponding 
computed entropies. The Best Basis Algorithm searches the tree in a bottom-up direction 
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and, whenever a parent node has a lower cost than that of its children, the Best Basis 
algorithm flags the parent. If the sum of the children’s costs is lower than that of the 
parent node, this lower cost is assigned to the parent. Similarly, children are flagged when 
they have a lower information cost than their parents. This step avoids the need to 
examine any node more than twice: once as a child and once as a parent. Figure 5.2 
presents the new and the former (in parenthesis) information costs for each node shown in 
Figure 5.1. Then, after all nodes present in the tree have been examined, the Best Basis 
Algorithm selects the topmost flagged nodes, which constitute a basis. Finally, as the 
topmost flagged node is encountered, the remaining nodes in the corresponding subtree 
are discarded. Figure 5.3 displays the best basis nodes for this example as shaded blocks. 
Further details may be found in [4]. Figure 5.4 shows a Best Basis tiling scheme resulting 
from the decomposition shown in Figure 5.3. It is obvious that each resulting cell 
occupies one portion of the time, and the whole frequency spectrum is covered by each of 
those cells. 



:s. 



Figure 5.1 Cosine packet tree with computed entropies for every interval (node) 
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Figure 5.2 New (and former) computed entropy for each node 



Figure 5.3 Selection of minimum entropy basis 
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Figure 5.4 Best basis tiling scheme resulting from the decomposition in Figure 5.3 
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VI. COMPRESSION AND DENOISING SCHEMES 


This chapter describes the compression and denoising schemes used in this 
research. It is divided into five sections. First, the motivating concepts are introduced. In 
the remaining sections: minimum time window, voiced-imvoiced segmentation, adaptive 
thresholding and denoising are presented. 

A. INTRODUCTION 

Initial research for this thesis included reviewing existing lossy compression 
techniques, which are divided into two main classes: Lossy Predictive Coding and 
Transform Coding [12]. The attention of this thesis is directed to Transform Coding. The 
Transform Coding technique that has been largely discussed, applied, and tested is the 
Wavelet Transform. However, as explained in Chapter IV, wavelets are more appropriate 
for the analysis of either transients or long-duration, low-fi-equency stationary signals 
than for speech signals. 

As shown in Chapter III, the Local Cosine Transform has good time and 
frequency resolution. Also, unlike the Fourier Transform, the Discrete Cosine Transform 
IV (DCT-IV) decorrelates the signal in each window, which facilitates compression. 
Experiments for this research demonstrated that the Best Basis Algorithm, besides 
selecting the basis with minimal entropy, is also able to split the speech signal into locally 
stationary time segments. As a result, the combination of the Cosine Packet scheme with 
a method that selects the Best Basis (BB) configuration to minimize the entropy in each 
interval seems to be most appropriate for the applications considered here. 

An important characteristic of the Cosine Packet Transform (CPT) is that it allows 
time resolution to be controlled. If one uses the WPT with the Best Basis Algorithm on 
speech, the algorithm chooses the basis based on the minimization of some information 
cost of the frequency coefficients. Thus, in the WPT case, time resolution is not a 
function of the physical properties of speech. Instead, it is dependent on each scale which 
in turn is selected by the best basis criterion. Also, the user must select the maximum 
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frequency splitting depth by choosing the worst (largest) time resolution, not the best. 
With the CPT, on the other hand, it is possible to choose the depth and, thus, to determine 
the minimum time interval, which ideally should coincide with the minimum locally 
stationary portions of speech. 

Once the signal is divided into its locally stationary intervals, the DCT-IV 
algorithm is applied to transform the signal to the frequency domain. Then, for all of the 
time windows, the signal is passed through a thresholding scheme that picks up different 
percentages of coefficients, according to the frequency and energy contents of each 
frame. Basically, the speech is divided into its voiced and imvoiced sounds, making it 
necessary to implement a scheme for voiced-unvoiced segmentation. 

Recordings made for this research included noise generated by the equipment. 
This noise was composed basically a of 60 Hz hum and harmonic components. Since the 
noise frequencies in each time window were detectable, it was possible to denoise the 
words and sentences used in the experiments. The system is composed of the three main 
blocks, as shown in Figure 6.1. 



Figure 6.1 System block diagram 







The Cosine Packet scheme, presented in Chapters IV and V, is based on the CPT and 
Best Basis Algorithm. The encoding/compression schemes investigated in this work will 
be presented in Chapter VII. 

B. MINIMUM TIME WINDOW SIZE 

The choice of the minimum time window depends on the time and frequency 
resolution desired. We recall that, in the CPT scheme, the further down on the tree, the 
better the time resolution, and the worse the frequency resolution. A second consideration 
is to represent a clean signal in an optimal way, so that the DCT-IV coefficients (in the 
frequency domain) lead to the smallest number that best represent the energy and 
frequency content of each interval. Ideally, the signal should be divided into the exact 
locally stationary portions of the speech, each beginning and ending at the correct points. 
This is to obtain good compression ratios, where each time interval should have one or 
two representative coefficients. 

The best minimum window sizes were 32 or 16 milliseconds for most of the 
experiments, and 8 milliseconds for some of them. Since samples were taken at 8 KHz, 
this means that the intervals are 256, 128, or 64 samples, respectively. Using windows 
shorter than 16 ms degraded the frequency resolution for most of the test words and 
sentences, which led to the following two results: 

(1) Loss of coarticulation; 

(2) Degradation in denoising performances. 

Although the depth corresponding to the 16-ms minimum-size window was not 
always the one that gave the best (least) mean square error (i.e., comparing to 32-ms and 
8-ms test windows), the difference obtained in that parameter was not large enough to 
justify choosing another depth. This was mainly due to the quality factor in 
reconstruction. Consequently, 16 milliseconds was selected as a compromise for the 
minimum window size. 
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C. VOICED-UNVOICED SEGMENTATION 

This section presents an experiment based on the voiced-unvoiced segmentation 
scheme proposed by Wesfreid and Wickerhauser [13]. Recognition of certain excitation 
types was attempted to obtain the best possible scheme for compression. Therefore, 
speech partitioning became one of the subproducts of this research. Once each interval’s 
magnitude spectra and energy are obtained, it is possible to identify voiced and unvoiced 
portions of the speech. 

The spectrum is divided into six main frequency ranges. Table 6.1 displays the 
low and high frequencies in each range, as well as the corresponding amplitudes of the 
vertical bars used to separate the intervals. 


Frequency Range(Hz) 

Vertical Bars 

Low 

High 

Amplitude 

0 

250 

0.1 

251 

500 

0.25 

501 

1,000 

0.5 

1,001 

2,000 

1.0 

2,001 

3,000 

2.0 

3,001 

4,000 

2.5 


Table 6.1 Frequency ranges and display 

Figure 6.2 illustrates the short-time energy and zero-crossing plots (top) from 
Voicedit, from the SPC Toolbox [16], for the sentence “/This place blows/” (bottom). 
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Figure 6.3 presents four plots. The first shows the time domain plot. The second and third 
plots show, respectively, the frequency behavior according to Table 6.1, and the energy 
behavior obtained by summing the squares of the coefficients in each interval. The fourth 
plot (bottom) shows the spectrogram of the speech signal. Note that the tendency of both 
frequency and energy plots match those of Figure 6.2. 

Voiced-unvoiced segmentation obtained the best results when all the intervals 
with the largest coefficient positioned at a frequency below 1,000 Hz, and energy above a 
certain threshold were assigned as voiced. All the intervals with the largest coefficient at 
a frequency above 1,000 Hz were assigned as unvoiced. Figure 6.3 illustrates that a 
voiced sound results in a high energy and low frequency (largest coefficient frequency 
below 1,000 Hz) representation for those segments. This is the case for the sounds “///, ” 
“/a/, ” and “/o/. ’’ In turn, unvoiced sounds are recognized as segments with high 
frequency (largest coefficient frequency above 1,000 Hz) and low energy content. This is 
the case of the sounds “/s/ ” from “this” and “place.” Figure 6.4 shows the result of the 
voiced-unvoiced segmentation scheme, which can be observed in the middle plot. The 
bottom plot contains the corresponding spectrogram. Figures 6.5, 6.6, and 6.7 present the 
same kind of plots for the sentence “Be nice to your sister.” Again, the voiced sounds 
“/a//, ” “/o/, ” and “///” are distinguishable from the unvoiced 'Vs/, ” and “///.” 

D. ADAPTIVE THRESHOLDING 

This section utilizes the partitioning of speech into voiced-unvoiced segments to 
implement an adaptive scheme for selecting cosine packet coefficients. 

Experiments showed that a more natural sounding speech was reconstructed after 
compression when using more coefficients to represent voiced than unvoiced segments. 
This resulted in the use of a different percentage of coefficients in the following four 
cases; 

A) Low frequencies, low energy 
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B) Low frequencies, high energy 

C) High frequencies, low energy 

D) High frequencies, high energy. 

The need to select more coefficients to represent the voiced segments of speech is 
illustrated in the example where the isolated noise-free word “nice” is compressed. As 
explained in Section B, the minimum window size chosen is 16 ms. Figure 6.8 shows 
that, when the compression scheme is set to keep one cosine packet coefficient per 16 ms 
window to represent the phoneme N, the higher formants of that phoneme are lost. As a 
result, the phoneme /i/ tends to sound like a /u/. This example illustrates the fact that 
more than one coefficient may be required to represent voiced phonemes accurately. 
Figure 6.9 presents the plots that result when two CP coefficients per 16 ms window are 
selected to represent voiced phonemes (including phoneme /i/), and one CP coefficient 
out of every 16 or 32 ms interval is selected to represent unvoiced phonemes. Although a 
lower mean value is achieved for the percentage of selection (and, thus, a higher 
compression rate), the sound /i/ is correctly reconstructed without affecting the other 
phonemes of the word “nice.” 

Similar findings were obtained with other voiced phonemes such as /a/ and /o/. In 
addition, experiments showed that the voiced plosive /p/ was degraded by the 
compression process and sounded like a /b/. Keeping three cosine packet coefficients per 
16 ms window interval for voiced segments led to a more accurate representation of the 
information after compression, as confirmed by the smaller MSB and better sound quality 
in the reconstructed signal. Further experiments showed that one cosine packet 
coefficient per 16 ms interval is sufficient to represent the unvoiced segments accurately. 


E. DENOISING 

Previous sections have considered only the problem of compressing noise-free 
signals. However, some of our recordings had a significant amoimt of low frequency 
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equipment noise located around 60 Hz and some of its harmonics. As a result, a denoising 
step was investigated prior to compressing the data to improve the quality of the 
compressed signal. Thus, the noisy speech signal was denoised prior to applying the 
compression scheme. The denoising code is given in the Appendix. 

Two different cases where noise was present were considered: Noise-only data 
segments and noisy speech segments. Noise-only data segments occur before and after 
isolated word recordings, and between words in the sentence recordings. Experiments 
showed that the cosine packet coefficients allowed the detection of noise-only segments. 
The following two situations characterizes the noise-only case according to 
implementation ndencomp.m, given in the Appendix: 

(1) Whenever the largest coefficient in the segment is at a fi'equency less 
than or equal to 62.5 Hz, and the second largest coefficient is at a firequency less than or 
equal to 300 Hz or higher than 1,000 Hz; 

(2) Whenever the largest coefficient in the segment is in a frequency range 
between 62.5 Hz and 250 Hz, and the second coefficient is at a frequency less than 200 
Hz. 

The following situations characterizes the noise-only case according to imple¬ 
mentation encp6.m, given in the Appendix: 

(1) The largest coefficient in the segment is at a frequency less than or 
equal to 125 Hz, and the second largest coefficient is at a frequency less than 300 Hz; 

(2) The largest coefficient is at a frequency less than 62.5 Hz, and the 
second coefficient is at a frequency higher than 1,000 Hz; 

(3) The largest coefficient is at a frequency less than 500 Hz for the female 
speaker, or less than 1,000 Hz for the male speaker, and the second coefficient is at a 
frequency less than 125 Hz. 

All remaining cases are considered as noisy speech. For those cases, all 
coefficients located at frequencies below or equal to 62.5 Hz are zeroed out. 

Three specific noise-and-speech cases are presented as follows: 
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(1) Noisy speech-Case 1. An example of this case is the word “hey,” 
where the sound /h/ was lost in the backgroimd noise. Due to the higher frequency 
content of “/h/” (as opposed to the noise), it was possible to identify and pick up one 
more CP coefficient per interval; thus, retrieving the sound of “/h/.” This example is 
illustrated in Figures 6.10 and 6.11, which show time plots and spectrograms that 
correspond to keeping 1 CP and 2 CP coefficients/16- ms interval, respectively. 

(2) Noisy speech-Case 2. This problem required differentiation of the 
noise-only case from the noisy voiced stops lb/ and Ip/. Distinguishing these sounds from 
noise was easier than case 1 above, since the first largest coefficient obtained for those 
two phonemes was never less than 250 Hz, making it possible to denoise without 
interfering with those sounds. 

(3) Noisy speech-Case 3. There were difficulties in separating the weak 
ending I si, such as in “cats” and “let’s, from the background noise.” Whenever this case 
occurred, the Best Basis Algorithm produced a 32 ms time window with the first two 
largest coefficients at frequencies less than 125 Hz. Although the phoneme “/s/” is 
located at frequencies higher than 125 Hz, its energy was too small to be differentiated 
from that of the noise. Thus, the data contained in the phoneme Isl is identified as noise 
only and disregarded before the compression step. Figure 6.12 illustrates this case. 
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Figure 6.2 Sentence “This Place Blows,” male native speaker; top plot: Short-time 
energy, zero-crossing representation; bottom plot: Time domain representation, fs = 8 
KHz 
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Figure 6.3 Sentence “This Place Blows,” male native speaker, “compcp” implementation; 
(a) Time domain plot; (b) Frequency behavior plot according to Table 6.1; (c) Energy 
plot; (d) Spectrogram, using a Hanning time window of length 256 samples and 
overlapping of 128 samples between adjacent windows, fs = 8 KHz 
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Figure 6.4 Sentence “This Place Blows,” male native speaker, “compcp” implementation; 
(a) Time domain plot; (b) Voiced-unvoiced segmentation; (c) Spectrogram, using a 
Hanning time window of length 256 samples and overlapping of 128 samples between 
adjacent windows, fs = 8 KHz 
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Figure 6.5 Sentence “Be Nice to Your Sister,” female native speaker; Top plot: Short- 
time energy and zero-crossing; bottom plot: Time domain plot 
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Figure 6.6 Sentence “Be Nice to Your Sister,” female native speaker, “compcp” 
implementation; (a) Time domain plot; (b) Frequency behavior plot, according to Table 
6.1; (c) Energy plot; (d) Spectrogram, using a Hanning time window of length 256 
samples and overlapping of 128 samples between adjacent windows, fs = 8 KHz 
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Figure 6.7 Sentence “Be Nice to Your Sister,” female native speaker, “compcp” 
implementation; (a) Time domain plot; (b) Voiced-unvoiced segmentation; 

(c) Spectrogram, using a Hanning time window of length 256 samples and overlapping 
of 128 samples between adjacent windows, fs = 8KHz 


46 









































































Normalized Frequency 


a)''NICE'', female speaker 



-1 '-^^^ 

1000 2000 3000 4000 


Time Samples 
b) SPECTROGRAM 



0 1000 2000 3000 


Time Samples 


c)AFTER FIXED THRESHOLDING(1.00%) 

1 I-^^^-n 



-1 '-■-^^^ 

1000 2000 3000 4000 

Time Samples 



d) SPECTROGRAM (AFTER) 
0.5 


1000 2000 3000 

Time Samples 


Figure 6.8 Word “Nice,” female native speaker, “ndencomp” implementation, fixed 
threholding with 1% coefficients kept after compression; (a) Original time plot; 

(b) Spectrogram of original time speech signal; (c) Plot after fixed thresholding selection 
of coefficients is applied; (d) Spectrogram of processed signal.(both spectrograms use a 
Hcinning time window of length 256 samples and overlaping of 128 samples between 
adjacent windows, fs = 8KHz) 


47 






























































































Normalized Frequency 



1000 2000 3000 4000 


Time Samples 

b) SPECTROGRAM 
0.5 

0.4 

0.3 

0.2 

0.1 

0 


c)AFTER ADAPTIVE THRESHOLDING(0.98%) 

1 I-^^---—-n 



^ 1000 2000 3000 4000 

Time Samples 

d) SPECTROGRAM (AFTER) 



0 1000 2000 3000 

Time Samples 



0 1000 2000 3000 

Time Samples 


Figure 6.9 Word “Nice,” female native speaker, “ndencomp” implementation, adaptive 
thresholding, with an average of 0.98% CP coefficients kept for compression; (a) Original 
time domain plot; (b) Spectrogram of original speech signal; (c) Time domain plot of 
processed signal; (d) Spectrogram of processed signal.(both spectrograms use a Hanning 
time window of length 256 samples and overlaping of 128 Samples between adjacent 
windows, fs = 8 KHz) 
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Figure 6.10 Word “Hey,” male non-native speaker, “ndencomp” implementation (/li/ lost 
after denoising scheme when it is identified as noise only); (a) Original time domain plot; 
(b) Spectrogram of original signal; (c) Time plot after denoising/compression scheme; 

(d) Spectrogram after denoising/compression scheme, (both spectrograms use a Hanning 
time window of length 256 Samples and overlapping of 128 samples between adjacent 
windows, fs = SBCHz) 
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Figure 6.11 Word “Hey,” male non-native speaker, “ndencomp” implementation (/h/ 
recovered after denoising scheme when it is identified as a noisy speech); (a) Original 
time domain plot; (b) Spectrogram after denoising/compression scheme; (c) Time plot 
after denoising/compression scheme;(both spectrograms use a Hanning time window of 
length 256 and overlaping of 128 samples between adjacent windows, fs = 8 KHz) 
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Figure 6.12 Word “Cats,” female non-native speaker, “ndencomp” implementation (/s/ 
lost after denoising scheme when it is identified as noise only); (a) Original time domain 
plot; (b) Spectrogram of original speech; (c) Time domain plot after denoising / 
compression; (d) Spectrogram after denoising/compression (both spectrograms use a 
Hanning time window of length 256 and overlapping of 128 samples between adjacent 
windows, fs = 8 KHz) 
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VII. ENCODING SCHEMES 


This chapter is divided into three main sections. The first proposes a quantization 
scheme to transmit the CP coefficients. The second proposes encoding schemes to 
transmit the side information, i.e., both the locations and the initial indexes of each 
segment. The third section presents the coding scheme used to transmit the coefficients 
vector, the locations vector and the vector containing the initial locations of each 
segment. 

A. THE QUANTIZATION SCHEME 

Once data is available for transmission, the user must quantize and code it. After 
the compression scheme proves to be efficient, and allows a good quality reconstruction, 
consideration is given to finding a uniform quantizer that can reproduce efficiently the 
coefficients to be transmitted [14]. 

Three different vectors must be sent for speech compression. The first vector 
contains the cosine packet coefficients. The second contains the location of the 
coefficients. The third vector contains the initial time locations of each segment. To 
transmit the first vector, i.e., the coefficients vector, the following is done: 

(1) The data are normalized by dividing all the vectors by the maximum 
absolute value of all the coefficients. This value turns out to be the scaling factor; 

(2) The whole vector is multiplied by QL/2 (where QL is the number of 
quantizing levels selected by the user), and rounded to the closest integer. 

By performing these steps, a QL-level quantizer is built. It has QL levels due to 
the normalization and further multiplication by QL/2, which assures that the positive and 
negative parts of speech will be always between -QL/2 and -i-QL/2. 

(3) The scaling factor, equal to maximum absolute value of all the 
coefficients, is sent. In the receiver the following steps are to be performed: 

(a) Upon receiving the vector, divide it by QL/2, recovering the 
rounded normalized coefficients vector; 
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(b) Use the scaling factor to recover the amplitudes of the original 
coefficients. Even without sending the scaling factor, it was possible to recover the 
coefficients and thus reconstruct the data. The only difference is that the data was scaled 
in amplitude by a constant factor. 

B. PROPOSED ENCODING SCHEMES 
1. Cosine packet coefficient locations 

To transmit the second vector, i.e., the locations vector, the user first must find 
the least cost means of transmission. The following example has the sequence of a typical 
location vector L: 

L = [ 1806 1807 1841 1842 1847 1930 1934 1935 2020 2021 2062 2147 2148 218 2192 
2193 2274 2318 2320 2322 2328 2406 2413 2414 2510 ...]. 

Note that there are small differences between some values in this sequence, while larger 
jumps take place less often. This is because the small differences occur within the same 
segment, and the larger differences indicate a change firom one segment to the adjacent 
one. Thus, the differences between successive locations are encoded, since they require a 
smaller number of bits. The differential locations vector correspondent to the locations 
vector above is given by the vector DL below: 

DL =[ 1806 1 34 1 5 83 4 1 85 1 41 85 1 40 4 1 81 44 2 2 6 78 7 1 96 ..]. 

As a result of sending the differences, it is also necessary to send the 
value for the first location, to allow for an exact reconstruction of the coefficients 
locations during the decoding process. 
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2. Segment Indexes 

The third vector to transmit is the vector that corresponds to the indexes of each 
segment. The Best Basis Algorithm selects the basis by searching for the minimum 
entropy representation. When the length of each new window is obtained, the algorithm 
outputs the two parameters “b” and “d,” which allow the beginning index of the next time 
window to be computed. The expression for obtaining index “i” is as follows: 



(7.1) 


where n is the original length of each window. Since the parameters “b” and “d” are small 
numbers, composed of one or two digits and, therefore, much smaller than the indexes 
themselves, it is a good idea to transmit the parameters instead of the indexes. Thus, the 
two vectors “nde” and “nbe,” which are composed of the parameters “b” and “d” of each 
time "svindow, are transmitted. For example, suppose the vectors nde and nbe are given as 
follows: 


nde = [4665566566554423665]. 

nbe = [ 0 4 5 3 4 10 11 6 14 15 9 10 11 6 7 2 6 56 57 ]. 

Considering n = 8192 time samples, the corresponding vector I containing the 
initial locations of the first eight segments is given by : 

I = [ 1 512 640 768 1024 1280 1408 1536...]. 

To reconstruct the locations vector of the non-zero coefficients, the receiver works 
on the received vector of differential locations DL and reconstructs L. The reconstructed 
vector is then called RL. 

Once the locations of non zero coefficients (vector RL) are available, along with 
the locations of the beginning of each new segment (vector I), the receiver will be able to 
apply the DCT transform to reconstruct the speech signal. 
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c. 


CODING SCHEMES 


After quantization, the coefficients vector is encoded using Huffman Coding 
[14], which minimizes the total number of bits by assigning more bits to less frequent 
symbols and less bits to more frequent ones. The vectors nde and nbe are transformed 
into only one vector and passed through the Huffrnan Coder. The inputs include the 
number of symbols and the probabilities of each one, whereas the outputs from the 
Huffman Coder are the coded words and average length of the symbols. In order to 
perform the quantization step and also compute the probabilities of occurrences of each 
symbol to be coded, the function quantx.m, given in the Appendix, was implemented. 
That function receives the original vector, the number of levels desired for quantization, 
and returns the quantized vector and the probabilities in descending order, as required by 
the Huffman Coder (the Huf&nan Coder used is given in the Appendix). The code was 
adapted as a function to be called whenever this step is necessary. Finally, the exact 
number of bits necessary to encode the differential locations vector (DL) is computed. 
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VIII. TESTS AND RESULTS 


A. INTRODUCTION 

This chapter describes the procedvires that are used to test the compression and 
encoding schemes. First, the basic compression scheme results are presented. Next, the 
combined denoising/compression schemes are given. Then, encoding performances, 
which are used to transmit the compressed information, are presented. Finally, the Cosine 
Packet compression scheme performances are compared with those obtained using the the 
related Wavelet Packet Transform. 

B. COMPRESSION SCHEME RESULTS 

The compression-only scheme is first applied to “clean” speech to evaluate its 
performance. To test this scheme on isolated words, we use the words “project,” 
“cataratas,” and the segment “encyclope,” extracted from the word encyclopedia. This 
compression scheme is also implemented in the following two sentences: 

“ Be nice to your sister,” spoken by a female native speaker; and 
“ This place blows,” spoken by a male native speaker. 

1. Description 

The testing software requires the user to input the following: 

(1) The gender of the speaker. This information is required since the pitch 
for a female speaker occurs at a higher frequency than that of a male speaker; 

(2) Word or sentence to be compressed; 

(3) Maximum depth used for the cosine packet time splitting, which in 
turns fixes the minimum size of the window; 

The following outputs are provided: 

(1) The mean square error between the original and the reconstructed 

speech signal; 
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(2) The number of non-zero cosine packet coefficients in the original 
signal (ONCOEF); 

(3) The number of non-zero cosine packet coefficients selected by the 
compression scheme (FNCOEF); 

(4) The reconstructed speech signal obtained after compression. 

Two different compression implementations were considered, which differ in the 
number of cosine packet coefficients kept to compress the speech signal. The first 
compression scheme (implemented in compcp.m, given in the Appendix) selects the 
cosine packet coefficients as follows: 

(1) Keep the top 0.5% non-zero coefficients (rounded to the closest 
integer) in each time window when the speech segment is detected as unvoiced; this 
percentage means selecting one coefficient out of every interval containing 128 
coefficients, one coefficient out of every interval containing 256 coefficients, and so on, 
according to the result of the rounding process; 

(2) Keep the top 1.3% non-zero coefficients (rounded to the closest 
integer) for each time window of minimum length (16 ms) when the speech segment is 
identified as voiced; this means selecting two coefficients out of every 128 coefficients, 
three coefficients out of every interval containing 256 coefficients, and so on; 

(3) Keep the top 2.34% non-zero coefficients (rounded to the closest 
integer) for each time vnndow larger than 16 ms when the speech segment is identified as 
voiced; this means selecting three coefficients out of every interval containing 128 
coefficients, six coefficients out of every interval containing 256 coefficients, and so on. 

The second compression scheme (implemented in necompcp.m and given in the 
Appendix) uses the following schemes to compress the speech signal: 

(1) Keep the top 0.5% non-zero coefficients (rounded to the closest 
integer) in each time window when the speech segment is unvoiced; this means selecting 
one coefficient out of every interval containing 128 coefficients, one coefficient out of 
every interval containing 256 coefficients, and so on; 
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(2) Keep the top 1.3% non-zero coefficients (rounded to the closest 
integer) for each time window when the speech segment is identified as voiced; this 
means selecting two coefficients out of every interval containing 128 coefficients, three 
coefficients out of every interval containing 256 coefficients, and so on. 

2. Experimental Results 

Results obtained for the two compression schemes are presented in Table 8.2. In 
Chapter VI, Figures 6.8 and 6.9 present time domain plots and spectrograms for the 
Adaptive Thresholding compression scheme considered in this section. The parameters 
used to measure degradation due to the compression scheme are: 

(1) The mean square error (MSE) between the original and the reconstructed 
speech signal; 

(2) A subjective evaluation made by five different users of the quality of the 
reconstructed signal when compared to the quality of the original signal. The evaluation 
was graded on a scale from 1 to 5, according to Table 8.1. 


GRADE 

Speech Quality 

Level of Distortion 

5 

Excellent 

Imperceptible 

4 

Good 

Just perceptible but not annoying 

3 

Fair 

Perceptible and slightly annoying 

2 

Poor 

Annoying but not objectionable 

1 

Unsatisfactory 

Very annoying and objectionable 


Table 8.1 Mean opinion score table 


(3) The ratio between the number of non-zero cosine-packet coefficients kept after 
compression and the total number of initial non-zero coefficients obtained with the cosine 
packet decomposition (ONCOEF/FCOEF%). 
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3. Comments 

Note the slightly higher speech quality mean grades assigned to code compcp.m, 
which also presents a slightly higher percentage of coefficients kept (i.e., a lower 
compression ratio). Experiments showed that fixed thresholding selects 1% of the set of 
coefficients, and leads to the distortion of voiced phonemes (e.g., /i/ ends up sounding 
like IvJ in the word “nice”). The “after compression” spectrogram included in the bottom 
right of Figure 6.8 showed that the higher formant section of the phoneme /i/ has not been 
preserved in the compression. By comparison, Figure 6.9 showed the results obtained 
using an adaptive thresholding scheme, which selects more coefficients for the voiced 
segments while keeping a smaller total percentage of coefficients (0.98%). The after- 
compression spectrogram shown in Figure 8.2 shows that the high formants of the 
phoneme /i/ are better preserved, leading to a better reconstruction of the voiced 
phoneme. 


60 


Parameters 

** Project** 

**Cataratas** 

“Ency elope** 

*^Issos** 

**Assos** 

**Be 

Nice** 

^*This 

Place** 

Code: NECOMP.M 

MSE 

0.0045 

0.0315 

0.0127 

0.0074 

0.0136 

0.0070 

0.0432 

ONCOEF 

8192 

8192 

8192 

8192 

8192 

16384 

16384 

FNCOEF 

124 

100 

104 

100 

100 

210 

216 

% 

ONCOEF/ 

FNCOEF. 

1.51 

1.22 

1.27 

1.22 

1.22 

1.28 

1.32 

Speech 

quality 

mean grade 

2.2 

2.4 

2.8 

2.8 

2.2 

3.2 

3.0 

Code: COMPCP.M 

MSE 

0.0038 

0.0313 

0.0121 

0.0057 

0.0115 

0.005 

2 

0.029 

7 

ONCOEF 

(original # 
of 

coeff.>0) 

8192 

8192 

8192 

8192 

8192 

16384 

16384 

FNCOEF 

(final # of 

coeff. >0) 

181 

109 

126 

129 

115 

139 

228 

% 

ONCOEF/ 

FNCOEF 

2.21 

1.33 

1.54 

1.57 

1.40 

0.85 

1.39 

Speech 

quality 

mean grade 

2.6 

2.6 

2.8 

3.0 

2.2 

3.4 

3.2 


Table 8.2 Compression only results 
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c. 


DENOISING-COMPRESSION RESULTS 


1. Description 

Next, we consider the application of a combined denoising and compression 
scheme designed to minimize the effects of narrowband equipment noise in a few isolated 
words and sentences. The isolated words used in these tests are: 

“Be”, spoken by a female and by a male speaker; 

“Cats”, spoken by a female speaker; 

“Hey”, spoken by a female and by a male speaker; 

“Met”, spoken by a female speaker; and 

“Pay”, spoken by a female speaker. 

The sentences used are: 

“Hello, my name is Roberto, today is Tuesday;” and 

“Bye, guys. I’m going back to Brazil”, both spoken by a male speaker. 

Two different implementations for the denoising scheme are considered: The first 
is implemented in ndencomp.m and the second is implemented in encp6.m (both are 
listed in the Appendix). The noise identification and denoising process for each 
implementation can be found in Chapter VI, Section E, and in the Appendix. Details 
regarding the compression scheme for both implementations can be found in the 
Appendix. Table 8.3 presents the compression results for tests applied on the same 
“clean” words of the previous section, but using the codes ndencomp.m and encp6.m. 

2. Results 

The parameters used to evaluate the denoising/compression scheme are identical 
to those defined for the compression-only scheme, with the exception of the mean square 
error (MSE). This parameter was omitted because the denoising step produced a greater 
difference between the original and the reconstructed signals. The performance results are 
presented in Table 8.4. 
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Parameters 

**Project” 

**Caiaratas** 

**Encyclope** 

^*Issos** 

**Assos^^ 

^*Be 

Nice"' 

**This 

Place" 

Code; NDENCOMP.M 

ONCOEF 

8192 

8192 

8192 

8192 

8192 

16384 

16384 

FNCOEF 

189 

153 

152 

129 

145 

111 

224 

% 

ONCOEF/ 

FNCOEF 

2.31 

1.87 

1.86 

1.87 

1.77 

1.69 

1.37 

Speech quality 

mean grade 

2.5 

3.0 

3.0 

3.2 

2.6 

3.3 

3.2 

Code: ENCP6.M 

ONCOEF 

8192 

8192 

8192 

8192 

8192 

16384 

16384 

FNCOEF 

188 

149 

152 

129 

143 

270 

264 

% 

ONCOEF/ 

FNCOEF 

2.29 

1.82 

1.86 

1.86 

1.75 

1.65 

1.61 

Speech quality 

mean grade 

2.5 

3.3 

3.2 

3.4 

2.8 

3.5 

3.4 


Table 8.3 Compression results utilizing codes ndencomp.m and encp6.m 

The speech quality mean grade was computed for the following speech data: The 
words “Be,” female speaker, and “pay,” male speaker, and the sentences “Hello, my name 
is Roberto ...” and “Bye, guys. I’m going back ...” These results are presented in Table 
8.5. 
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Parameters 

Female 

Male 

**CalSy** 

Female 

Female 

Male 

**MeU** 

Female 

**Pay” 

Female 

Male 

“HeUo 

t 

My 

Name*^ 


Code: NDENCOMP.M 

ONCOEF 

6270 

5248 

8190 

8188 

9342 

8190 

7296 

7294 

32768 

24574 

FNCOEF 

59 

41 

143 

64 

42 

37 

49 

46 

569 

318 

% ONCOEF/ 

FNCOEF 

0.94 

0.78 

1.75 

0.78 

0.44 

0.45 

0.67 

0.63 

1.74 

1-29 

Code: ENCP6.M 

ONCOEF 

6272 

5248 

8192 

8192 

9344 

8192 

7296 

7296 

32768 

24576 

FNCOEF 

72 

48 

27 

65 

35 

40 

55 

34 

573 

348 

% 

picked 

1.15 

0.91 

0.33 

0.79 

0.37 

0.49 

0.75 

0.47 

1.75 

1.42 


Table 8.4 Denoising/compression results 


SPEECH 

ndencomp 

encp6 

“Be” (female speaker) 

2.2 

3.4 

“Pay” (male speaker) 

2.6 

2.6 

“Hello, my name is ...” 

3.2 

3.2 

“Bye, guys ...” 

2.4 

2.6 


Table 8.5 Speech quality mean grades 
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3. 


Comments 


We note that the overall speech quality mean grades in Table 8.3 are slightly 
increased when compared to those from Table 8.2. We also note that the 
ONCOEF/FNCOEF percentages in Table 8.4 are small, since large sections of data are 
identified as noise-only. Thus they are not retained for compression by the denoising step. 
Results obtained for both denoising/compression schemes show slightly better speech 
quality for the encpb.m implementation than for the ndencomp.m implementation. 

a. Word “be” 

Both denoising/compression schemes produce good results for the word “be” 
for male and female speakers. The plots in Figure 8.1 show the efficiency of the 
algorithm in both the time and frequency domain. The quality of the reconstructed speech 
is good, as illustrated by the grades assigned by five native listeners. 

b. Word “Hey” 

For male and female cases, both denoising/compression schemes produce 
good results (Figures 8.2 and 8.3). The quality of the reconstructed speech is high, as 
confirmed by the listening tests. Note that the /h/ soimd in the female speech has a higher 
frequency than that of the male voice. The denoising schemes also allow the phoneme /hi 
to be differentiated from the noisy background environment. 

d. Word “met” 

Both denoising/compression schemes produce a good reconstruction of 
“/me/” and a poor reconstruction of the phoneme /t/, which is reconstructed sounding like 
a “/d/.” This degradation is due to the combination of too few coefficients kept for 
compression in this section of the word and a noisy background. 
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e. Word 

The reconstructed sound quality is better for the male case (Figures 8.4 and 
8.5). Again, this was due to too few coefficients being kept. The sound /p/ has its first two 
largest CP coefficients below 400 Hz and around 1,000 Hz, respectively. Higher energy is 
concentrated in the lower frequency coefficients. When spoken by a female, the higher 
frequency coefficients get less energy compared to others not so important from the lower 
frequency portion. For that reason, fewer coefficients from the higher frequency portion 
are kept, leading to a poorer sound than the male version. 

f. Sentence ‘^Hello, my name is Roberto, today is Tuesday” 

The spectrograms in Figure 8.6 show that the main signal energy is preserved, 
and that denoising occurs in the correct time intervals. 

g. Sentence " Bye, guys, I’m going back to Brazil ” 

For this sentence, both denoising-plus-compression schemes result in a good 
reconstruction. It is possible to observe in the spectrograms of Figure 8.7 that the 
algorithm picks up the important cosine packet coefficients. In this case, no significant 
amount of denoising was done due to the high quality of the original speech. However, it 
is worth comparing the effects of the denoising schemes. Note that, in using ndencomp.m 
the resultant signal is divided more by noisy intervals than when using encp6. In the 
listening tests for both sentences, the mean grade assigned to the reconstruction using 
ndencomp.m is better than the one assigned when using encp6.m. Basically, the unvoiced 
sovmds had a better reconstruction using the former code, whereas the latter code 
produced some distortion, leading to what was called a mechanical sound by some 
listeners. 
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D. ENCODING SCHEMES RESULTS 


1. Description 

The data used for these tests consist of twelve speech sequences of lengths 8192 
(ten words and/or sounds), 32768 (one sentence) and 65536 (one dialogue). The software 
used for this set of tests includes voiced-unvoiced segmentation, denoising, compression, 
and encoding steps. Both denoising/compression schemes are used in these tests. The 
coding software is presented in the Appendix. The minimum window size is 16 
miliseconds. The compression ratio between the total number of bits after encoding and 
the total original number of bits, is used to evaluate the performance of the encoding 
scheme. The original number of bits is computed by multiplying the number of bits used 
to represent each incoming sample (the samples had 8 bits and were PCM compressed) 
by the original number of samples. For example, for each of the ten sequences of length 
8192, the original number of bits is 8192 • 8 = 65,536 bits per speech sequence. The 
following speech sequences are used in our tests: 

(a) “BE,” spoken by a female speaker; 

(b) “HEY,” spoken by a female speaker; 

(c) “MET,” spoken by a female speaker; 

(d) “PAY,” spoken by a female speaker; 

(e) “CATS,” spoken by a female speaker; 

(f) Word “PROJECT,” spoken by a male speaker; 

(g) Word “CATARATAS,” spoken by a male speaker; 

(h) Soimd or partial word “ENCYCLOPE”, spoken by a male speaker; 

(i) Sound “ASSOS,” spoken by a male speaker; 

(j) Sound “ISSOS,” spoken by a male speaker; 

(k) Sentence “Bye guys. I’m going back to Brazil,” male speaker; 

(l) Dialogue from a telephone conversation, male and female speakers. 
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2. Results 


A measure of distortion is obtained by comparing the quality between the original 
speech signal and the reconstructed signal. (“Speech quality” in Table 8.1). 

To evaluate the efficiency of the encoding scheme, the following 
parameters are chosen: 

(1) The COMPRATIO, defined as one minus the ratio between the total 
number of bits after compression and the total number of bits in the original signal; 

(2) The mean square value of the quantization error. 

The performance results for the encoding scheme are presented in the Table 8.6. 
All results are based on the denoising/compression implementation ndencomp.m, except 
for the words “hey” and “met,” which use encp6.m. 


SPEECH 

SPEECH 

QUAL 

COMPRATIO 

% 

MSE 

“Be” 

2.6 

98.70% 

5.12e'’ 

“Hey” 

3.2 

98.56% 

9.32e'* 

“Met” 

3.2 

98.85% 

5.93e‘‘ 

“Pay” 

3.0 

99.17% 

5.22e’’ 

“Cats” 

3.4 

98.59% 

l.Ole'^ 

“Project” 

3.2 

97.87% 

1.06e'^ 

“Cataratas” 

3.4 

97.65% 

4.61 e'^ 

“Encyclope” 

3.8 

97.64% 

2.86e"' 

“Assos” 

3.2 

97.87% 

3.62e'^ 

“Issos” 

4.0 

98.05% 

2.16e'^ 

“Bye, guys...” 

2.4 

98.10% 

2.707e'^ 

Tel. conversation 

2.6 

98.06% 

2.843e'^ 


Table 8.6 Encoding results, 64-level quantizer. 
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The mean speech quality grade assigned is 3.2 (see MOS Table 8.1). This means 
perceptible and slightly annoying. We also note the high values of compression ratios and 
the very small values of MSE, which corresponds to the mean square quantization error. 

The compression ratio is calculated in the following way. Eight bits are used for 
each original sample of data, since that is the number used to load speech recordings into 
“Matlab.” The total number of bits is computed by multiplying each average number 
from the Huffrnan coder by the corresponding number of samples in the coefficients 
vector, as well as in the three vectors used to transmit the locations and the window 
boundaries. The compression ratio is then computed as the ratio between the final total 
number of bits and the original total mumber of bits after the encoding process. 
Comparing the percentages from this encoding table to the ones from the previous 
sections(compression and denoising/compression), we note that, although still very low, 
the numbers from the encoding process are higher (~ 2%) in comparison to the others (~ 
0.85%). The reason is that, in addition to the cosine packet coefficients, the side 
information (i.e., the locations of those coefficients) must also be encoded. Thus, even 
though the number of bits is reduced due to the quantization process, the increase of 
information to be transmitted makes the number a little higher. 

As can be noted from the grades assigned, the encoding process results are good. 
The only problem are the low-energy coefficients corresponding to unvoiced sounds 
when submitted to the quantization and rounding processes. Figure 8.8 shows the word 
“project,” which lost its weak, final /kt/. Even when we change the quantizer to 32 and 64 
levels, it is still impossible to recover the final sound. 

Figure 8.9 presents the sentence, “Be nice to your sister,” using a 16-level 
quantizer. We note that the sounds /s/ in “nice”, IXJ in “to,” and /r/ in “your” are lost. 
However, when the quantizer is changed to 32 levels, the main parts of these sounds are 
recovered (Figure 8.10). Finally, when the number of levels is increased to 64 (Figure 
8.11), the sequence sound is totally reconstructed, and practically no difference is noted 
between the original and reconstructed soimds. 
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Similar results are observed for the sentence, “Bye guys, I’m going back to 
Brazil.” The phoneme /z/ from “guys” is lost with a 16-level quantizer (Figure 8.12), and 
is recovered with a 32-level quantizer (Figure 8.13). Similarly, when a 16-level quantizer 
was applied, the sound /h/ in the word “hey” was reconstructed like a /k/, resulting in a 
word sounding like “kay” (Figure 8.14). By changing to a 32-level quantizer, it was 
possible to recover the correct sound (Figure 8.15). The sound was even better with a 64- 
level quantizer (Figure 8.16). Note the sequential progress in the coefficients recovered in 
Figures 8.14 through 8.16, by comparing the plots (d) and (f). 

Two points are worth mentioning. First, after the number of quantizing levels is 
doubled, the compression ratio does not decrease significantly. For example, for the word 
“project” (with a higher SNR, an almost “clean” word), when the number of levels is 
increased from 16 to 32 (i.e. changing from 4 to 5 bits/symbol), the compression 
percentage changes from 98.36% (1:61.2) to 98.28% (1:58.4). Another example is the 
word “hey” (also a high SNR). The three compression percentages corresponding to the 
16-level, 32-level, and 64-level quantizers are, respectively, 99.23% (1:130.4), 99.16% 
(1:119.2), and 99.11% (1:113.2), respectively. Thus, a 64-level quantizer is used as a 
good compromise between quality and compression. 


E. COMPARISON WITH WAVELET PACKET TRANSFORM 

In this section, the Cosine Packet is compared to the Wavelet Packet-based 
compression procedure in clean (high SNR) speech. A “clean”( high SNR) speech 
sequence is chosen, and the results are compared up to only the compression scheme, 
since the encoding scheme performs basically the same for both cases. 

The sentence “Be nice to your sister” is compressed using the Cosine Packet 
Transform, and the average percentage of non-zero coefficients selected from the original 
number equals 0.85% for a good reconstruction of the speech. A much poorer 
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reconstruction results from the Wavelet Packet Transform using the “Daubechies(4)” 
wavelet basis function. The WT is implemented using the same criteria as those defined 
for the CPT implementation with the WaveLab Package [17]. 

The result obtained for this sentence can be analyzed through the time and 
frequency plots for both schemes given in Figures 8.17 and 8.18. We note that, in the 
Wavelet Packet Transform, there are “holes” in the time domain. We note also that those 
“holes” happen to be exactly at the intervals where the energy is lower, i.e., mainly at the 
unvoiced sounds. This is because the WPT scheme initially splits the signal into given 
frequency windows. In our example, only the highest 15% coefficients for given 
frequency ranges are selected during the whole period of time. 

By comparison, the CPT splits the signal first in the time domain. Then, for each 
time frame, a thresholding is applied for the cosine packet coefficients. As a result, 
although many fewer coefficients are selected, there is no chance of having a time 
interval not represented. Actually, in this scheme, the holes are in the frequency domain. 
But, since the transform is good enough to detect the main frequencies contained in each 
locally stationary portion of the signal, the few cosine packet coefficients preserved at 
each time interval are sufficient to allow for a good reconstruction of the speech. These 
results confirm the theoretical expectation of superiority of the CPT over the WPT for 
speech signal compression applications. 
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Figure 8.1 Word “5e,” male non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original plot; (c) Time domain plot 
after denoising/compression; (d) Spectrogram after denoising/compression (both 
spectrograms use a Harming time window of length 256 samples and overlapping of 128 
samples between adjacent windows, fs = 8 KHz) 
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Figure 8.2 Word ‘^Hey," male non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot;(b) Spectrogram of original speech; (c) Time domain plot 
after denoising/compression; (d) Spectrogram after denoising/compression (both 
spectrograms use a Hanning time window of length 256 samples and overlapping of 128 
samples between adjacent windows,fs = 8 KHz) 
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Figure 8.3 Word ‘"‘'Hey” female non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original speech; (c) Time domain plot 
after denoising/compression; (d) Spectrogram after denoising/compression (both 
spectrograms use a Harming time window of length 256 samples and overlapping of 128 
samples between adjacent windows, fs = 8 KHz) 
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Figure 8.4 Word female non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original speech; (c) Time domain plot 
after denoising/compression; (d) Spectrogram after denoising/compression (both 
spectrograms use a Hanning time window of length 256 samples and overlapping of 128 
samples between adjacent windows, fs = 8 KHz) 
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Figure 8.5 Word male non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original speech; (c) Time domain plot 
after denoising/compression; (d) Spectrogram after denoising/compression (both 
spectrograms use a Hanning time window of length 256 samples and overlapping of 128 
samples between adjacent windows, fs = 8 KHz) 
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Figure 8.6 Sentence ‘‘‘’Hello, my name is Roberto, today is Tuesday,” male non-native 
speaker, “ndencomp” implementation; (a) Original time domain plot; (b) Spectrogram of 
original speech; (c) Time domain plot after denoising/compression; (d) Spectrogram after 
denoising/compression(both spectrograms use a Hanning time window of length 256 
samples and overlapping of 128 samples between adjacent windows, fs = 8 KHz) 
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Figure 8.7 Sentence “Bye, guys, I’m going back to Brazil,” male non-native speaker, 
“ndencomp” implementation; (a) Original time domain plot; (b) Spectrogram of 
original speech; (c) Time domain plot after denoising/compression; (d) Spectrogram 
after denoising/compression (both spectrograms use a Hanning time window of length 
256 samples and overlapping of 128 samples between adjacent windows, fs = 8KHz) 
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Figure 8.8 Word “Prq/ec/,” male non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original speech; (c) Time domain plot 
after denoising/compression; (d) Spectrogram after denoising/compression; (e) Time 
domain plot after decoding, 16-level quantizer; (f) Spectrogram after decoding, 16-level 
quantizer (both spectrograms use a Hanning time window of length 256 samples and 
overlapping of 128 samples between adjacent windows, fs = 8 KHz) 
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Figure 8.9 Sentence “Be nice to your sister” female native speaker, “ndencomp” 
implementation; (a) Original time domain plot; (b) Spectrogram of original speech; 

(c) Time domain plot after denoising/compression; (d) Spectrogram after denoising / 
compression;(e) Time domain plot after decoding, 16-level quantizer; (f) Spectrogram 
after decoding, 16-level quantizer (both spectrograms use a Hanning time window of 
length 256 samples and overlapping of 128 samples between adjacent windows, fs = 8 
KHz) 
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Figure 8.10 Sentence nice to your sister” female native speaker, “ndencomp” 
implementation; (a) Original time domain plot; (b) Spectrogram of original speech; 

(c) Time domain plot after denoising/compression; (d) Spectrogram after denoising/ 
compression; (e) Time domain plot after decoding, 32-level quantizer; (f) Spectrogram 
after decoding, 32-level quantizer (both spectrograms use a Hanning time window of 
length 256 samples and overlapping of 128 samplesbetween adjacent windows, fs = 8 
KHz) 
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Figure 8.11 Sentence nice to your sister” female native speaker,“ndencomp” 

implementation; (a) Original time domain plot;(b) Spectrogram of original speech; 
(c) Time domain plot after denoising/compression; (d) Spectrogram after denoising/ 
compression;(e) After decoding, 64-level quantizer; (f) After decoding, 64-level 
quantizer(both spectrograms use a Hanning time window of length 256 samples and 
overlapping of 128 samples between adjacent windows, fs = 8 KHz) 
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Figure 8.12 Sentence “Bye, guys, I’m going back to Brazil” male non-native speaker, 
“ndencomp” implementation; (a) Original time domain plot; (b) Spectrogram of original 
speech; (c) Plot after denoising/compression; (d) Spectrogram after denoising / 
compression; (e) Time domain plot after decoding, 16-level quantizer; (f) Spectrogram 
after decoding, 16-level quantizer (both spectrograms use a Hanning time window of 
length 256 samples and overlapping of 128 samples between adjacent windows, fs = 8 
KHz) 
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Figure 8.13 Sentence guys, I’m going back to Brazil,” male non-native speaker, 
“ndencomp” implementation; (a) Original time domain plot; (b) Spectrogram of original 
speech; (c) Plot after denoising/compression; (d) Spectrogram after denoising / 
compression; (e) Time domain after decoding, 32-level quantizer; (f) Spectrogram after 
decoding, 32-level quantizer (both spectrograms use a Hanning time window of length 
256 samples and overlapping of 128 samples between adjacent windows, fs = 8 KHz) 
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Figure 8.14 Word “Hey” female non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original speech; (c) Plot after 
denoising/compression; (d) Spectrogram after denoising/compression; (e) Time domain 
plot after decoding, 16-level quantizer; (f) Spectrogram after decoding, 16-level quantizer 
(both spectrograms use a Hanning time window of length 256 samples and overlapping of 
128 samples between adjacent windows, fs = 8KHz) 
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Figure 8.15 Word ’’’’Hey,” female non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original speech; (c) Plot after 
denoising/compression; (d) Spectrogram after denoising/compression;(e) After decoding, 
32-level quantizer; (f) After decoding, 32-level quantizer (both spectrograms use a 
Hanning time window of length 256 samples and overlapping of 128 samples 
between adjacent windows, fs = 8 KHz) 
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Figure 8.16 Word ‘‘"Hey” female non-native speaker, “ndencomp” implementation; 

(a) Original time domain plot; (b) Spectrogram of original speech; (c) Plot after 
denoising /compression; (d) Spectrogram after denoising/compression;(e) After decoding, 
64-level quantizer; (f) After decoding, 64-level quantizer (both spectrograms use a 
Hanning time window of length 256 samples and overlapping of 128 samples between 
adjacent windows, fs = 8 KHz) 
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Figure 8.17 Sentence nice to your sister,” female native speaker, compressed 
with the CPT, “ndencomp” implementation; (a) Original time domain plot; 

(b) Spectrogram of original speech; (c) Plot after denoising/compression with 0.85% non¬ 
zero coefficients selected; (d) Spectrogram after denoising/compression ( both 
spectrograms use a Hanning time window of length 256 samples and overlapping of 128 
samples between adjacent windows, fs = 8 KHz) 
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Figure 8.18 Sentence nice to your sister” female native speaker, compressed 

withWPT, using a “Daubechies” basis function; (a) Original time domain plot; 

(b) Spectrogram of original speech; (c) Plot after denoising/compression with 15% non¬ 
zero coefficients selected; (d) Spectrogram after compression ( both spectrograms use a 
Hanning time window of length 256 samples and overlapping of 128 samples between 
adjacent windows, fs = 8 KHz) 
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IX. CONCLUSION 


In this thesis, compression schemes based on the Cosine Packet Transform using 
the Local Cosine Transform are presented. The basis functions are chosen via the Best 
Basis Algorithm using the entropy minimization criterion. 

Coefficients for compression are chosen with an adaptive scheme, which selects 
more cosine packet coefficients for voiced intervals than for unvoiced ones. In addition, 
since some recorded speech sounds have equipment noise, a denoising scheme is 
performed. 

Finally, an encoding scheme is implemented. Thus, this study simulates the entire 
process of denoising, compression, and encoding (on the transmitter side), as well as 
decoding and reconstruction (on the receiver side). 

The results obtained are good, due to the combination of certain factors, which 
include the following: 

(a) Good time and frequency resolution of the local cosine transform; 

(b) The Cosine Packet Transform, combined with the Best Basis algorithm using 
the entropy minimization criterion allowed not only for minimizing the entropy, but also 
for the splitting of the signal into its locally stationary portions. These two factors greatly 
contribute to the success of the compression scheme; 

(c) The Adaptive Thresholding scheme helps to optimize in quality and quantity 
the number of cosine packet coefficients, while preserving good compressed signal 
properties; 

(d) The denoising scheme allows the number of non-zero coefficients to be 
reduced and, at the same time, a better quality of denoised sound when compared to the 
original noisy speech. 

Through the denoising attempts, it is possible to recognize some patterns of 
speech that would be hidden by the higher energy noise in regular compression. The 
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frequency analysis allows differentiating speech sounds and background noise and hence 
permits recovering most of the speech sounds. 

Basically, two main problems remain. First, there are a few sounds with low 
energy that need to be correctly identified and recovered from the background noise. In 
the experiments with noisy speech, the only cases that could not be solved are the weak 
unvoiced endings, like Ixl at the the word “met,” (which is reconstructed like a /d/) and /s/ 
at the end of words “cats” and “lets”, which is lost due to the denoising process. Although 
many phonemes were tried, there are probably some others that could have been 
attempted and, thus, this is a suggestion for further study. The second problem that was 
encountered is quantization noise. Although the encoding scheme works well enough to 
make speech recognition for many cases in the simulated receiver sounding “cleaner” 
than the original noisy signal, noise is introduced by the quantization process. Although 
very small, this noise is enough for cancelling endings like /kt/ in the word “project”. 
Since this research focused on the compression schemes, less effort is made to develop a 
better quantizing and encoding schemes (another point for further study). 

The CPT performs better than the WPT for speech compression applications. 
When using the WPT, the compression scheme begins losing low energy sounds much 
earlier than the CPT, i.e., with a much lower compression ratio, although this may be due 
to the basis function that was selected. 

The purpose of this study is to find an optimal scheme for the compression of 
speech signals. Since the scheme used in this study is successful, speech samples with the 
highest possible compression ratios are tested. The quality reconstruction that results for 
the majority of tries can be considered as “fair” (see Table 8.1), as shovm by the average 
mean grades assigned. The very small percentages of selected coefficients in the 
compression scheme result tables, and very high compression ratios for the encoding 
results, together with a “fair” quality reconstruction indicate a positive overall result. The 
compression ratios are not fixed, since the scheme is adaptive to the speech being 
analyzed. However our results indicate an average compression ratio of 1:50 on the 
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speech used in our study. The ratio can be adjusted for better quality of reconstructed 
sound, according to the needs and availability of the user. Evidently, there will always be 
a need to compromise between the compression ratio and the quality of the reconstructed 
speech. 
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APPENDIX. COMPUTER CODE 


% Name: Compcp.m and necompcp.m 

% Subject: Analysis, Compression and Synthesis routine of speech data 
% Desccription: 

% These two routines contain the following main parts: 

% a) Input and loading of speech to be used ( prompts the user for choices like gender of 
% speaker, word or sentences among those available and finest depth for time splitting); 

% b) Implements the Cosine Packet Transform (CPT) of the speech sequence; 

% c) Chooses the basis for the CPT by applying the Best Basis Algorithm; 

% d) Implements a Frequency Behavior and an Energy Behavior plot; 

% e) Implements a voiced-unvoiced segmentation; 

% f) Selects the coefficients by applying the Adaptive Thresholding scheme; 

% g) Applies the inverse CPT, by transforming each interval, unfolding and adding it to the 
% existing sequence; 

% h) Computes and presents the number of non-zero coefficients before and after the 
% compression scheme as well as the mean square error between the original and the 
% reconstructed sequences; 

% i) Presents plots containing the Frequency as well as the Energy behavior; also presents the 
% voiced-unvoiced segmentation plot as well as time domain and spectrogram plots of both 
% original and reconstructed sequences; 

% Notel: Parts b), c) and g) are extracted fi-om the software package Wavelab.600, Stanford 
% University[17]. This is also valid for the programs encp6.m, ndencomp.m, 

% encptour.m and ndentour.m; 

% Note 2: WaveLab code was modified to implement our compression schemes. 

% Written and adapted by J. Roberto V. Martins, in October 1995. 

% Compcp.m 

% Input and loading of speech to be used 
clear; 

V = input(Tlease enter ”1” for female voice and "2" for a male voice :'); 
ifV=l 
P = 2; 

FV = input(Tlease enter 1 for the sentence, 2 for "be" , 3 for "hate", 4 for "hey" , 5 for "met", 6 for 
"pay", 7 for "cats", 8 for "benice": ’); 
ifFV= 1 
clear ny; 
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load fse; 

ny = [fse’ zeros(l,5120)]; 


elseifFV=2 
clear ny; 
load fbe; 

ny = [fbe’ zeros(l,2048)]; 
elseifFV=3 
clear ny; 
load fha; 

ny = [fha’ zeros(l,7168)]; 
elseifFV==4 
clear ny; 
load fhey; 
ny = [fhey’]; 
elseif FV=5 
clear ny 
load finet; 

ny = [finef zeros( 1,1024)]; 
elseif FV=6 
clear ny 
load fjpay 

ny = [fpay’ zeros(l,1024)]; 
elseif FV=7 
clear ny 
load feats 
ny = [ feats’]; 
elseif FV=8 
clear ny 

benice = loadwav(’benice.wav’); 
ny = [(benice(l:16384)/max(abs(benice))+0.0119)']; 
end 
end 

ifV=2 
P = 2; 

W = input('Please enter l,for "project",2 for ’’cataratas’’,3 for "encyclopedia", 4 for '’issos",5 for "assos’’,6 
for "six",7 for "the sentence’’,8 for ’’aka’’,9 for "at", 10 for ’’azure’’,! 1 for "be", 12 for "bird", 13 for "boot", 14 
for "call",15 for "day’’,16 for "eka’’,17 for ’’epa", 18 for "eve",19 for "father’’,20 for "foot", 21 for "for", 22 
for "go", 23 for "hate", 24 for "he",25 for "ika",26 for "it",27 for "key’’,28 for "let",29 for ’’me’’,30 for 
"met",31 for ’’no",32 for ’’obey’’,33 for ’’opa’’,34 for "pay’’,35 for ’’read",36 for "see",37 for "she",38 for 
"then’’,39 for "thin", 40 for ’’to",41 for "up", 43 for "vote’’,44 for "we", 45 for "you", 46 for "zoo’’,47 for 
"silence", 48 for "the bye sentence",49 for "beback", 50 for "blows", 51 for "bruna’’,52 for "adams", 53 for 
"sounds good" : ’); 


ifW=l 
clear ny; 
load newvoice; 
ny = y(2700:2700+8191)'; 
elseif W=2 
clear ny; 
load catar; 

ny = ca( 1900:1900+8191)’; 
elseif W=3 
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clear ny; 
load encic; 

ny = en(1200:1200 +8191)’; 
elseifW=4 
clear ny; 
load issos 

ny = is(1900:1900+8191)'; 
elseifW=5 
clear ny 
load assos 

ny = as(1900:1900+8191)’; 
elseifW=6 
clear ny 
load six 

ny-si(l:8192)’; 
elseifW=7 
clear ny; 
load myvoice; 
ny = x(9000:9000+32767)'; 
elseifW=8 
clear ny; 
load aka; 

ny = (ac+0.1656)'; 
elseif W=9 
clear ny; 
load at; 

ny = (at+0.1655)'; 
elseif W=10 
clear ny; 
load azure; 

ny = [(az+0.1651)’ zeros(l,6144) ]; 
elseif W= 11 
clear ny; 
load be; 

ny = [(be+0.1654)’ zeros(l,3072) ]; 
elseif W=12 
clear ny; 
load bird; 

ny = [(bi+0.1658)' zeros(l,7168) ]; 
elseif W=13 
clear ny; 
load boot; 

ny = [(bo+0.1652)’]; 
elseif W= 14 
clear ny; 
load call; 

ny = [(cal+0.1654)' zeros(l,6144) ]; 
elseif W=15 
clear ny; 
load day; 

ny = [(da+0.1645)' zeros(l,1024) ]; 
elseif W=16 
clear ny; 
load eka; 
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ny = [(ek+0.1653)’]; 
elseifW=17 
clear ny; 
load epa; 

ny = [(ep+0.1650)’ zeros(l,6144) ]; 
elseifW=18 
clear ny; 
load eve; 

ny = [(ev+0.1654)’ zeros( 1,4096) ]; 
elseifW=19 
clear ny; 
load father; 

ny = [(fa+0.1648y zeros(l,6144) ]; 
elseifW=20 
clear ny; 
load foot; 

ny = [(foo+0.1653)’ zeros(l,6144) ]; 
elseifW=21 
clear ny; 
load for; 

ny = [(fo+0.1649)' zeros(l,6144) ]; 
elseifW=22 
clear ny; 
load go; 

ny = [(go+0.1651)’]; 
elseifW==23 
clear ny; 
load hate; 

ny = [(ha+0.1657)' zeros(l,7168) ]; 
elseifW=24 
clear ny; 
load he 

ny = [(he+0.1657)' zeros(l,2048) ]; 
elseif W=25 
clear ny; 
load ika 

ny = [(ik-K).1654)' zeros(l,6144) ]; 
elseif W=26 
clear ny; 
load it 

ny = [(it-K).1657)’ zeros(l,3072) ]; 
elseif W=27 
clear ny; 
load key; 

ny = [(ke + 0.1652)’ zeros( 1,2048)]; 
elseif W=28 
clear ny; 
load let; 

ny=[(le +0.1657)']; 
elseif W=30 
clear ny; 
load met; 

= [(met+ 0.1653)']; 
elseif W=31 


clear ny; ^ 

load no; 

ny = [(no + 0.1646)’ zeros(l,1024)]; 
elseifW=34 
clear ny; 
load pay; 

ny = [(pa + 0.1655)’ zeros(l,1024)]; 
elseifW=36 
load see; 

ny = [(se+0.1653)’ zeros(l,1024)]; 
elseifW=37 
load she; 

ny = [(sh+0.1654)']; 
elseifW=38 
load then; 

ny = [(th+0.1656)’ zeros(l,6144)]; 
elseifW=39 
load thin; 

ny = [(thi + 0.1655)’ zeros(l,1024)]; 
elseifW=40 
load to; 

ny = [(to + 0.1649)’ zeros(l,3072)]; 
elseifW=41 
load up; 

ny = [(up + 0.1653)’ zeros(l,2048)]; 
elseifW=43 
load vote; 

ny = [(vo + 0.1654)’ zeros(l,1024)]; 
elseifW=44 
clear ny; 
load we 

ny = [(we+0.1655)’ zeros(l,2048) ]; 
elseifW=45 
clear ny; 
load you 

ny = [(you + 0.1655)’ zeros(l,2048) ]; 
elseifW=46 
clear ny; 
load zoo 

ny = [(zo+0.1646)' zeros(l,2048) ]; 
elseifW=47 
clear ny; 
load myvoice; 
ny = x(l:8192)’; 
elseifW=48 
clear ny; 
load bye; 

ny = [ bye’ zeros(l,9216)]; 
elseifW==49 
clear ny; 

beback = loadwav(’beback.wav’); 
ny = [ (beback/max(abs(beback))+0.056)’ zeros(l,4824)]; 
elseifW=50 
clear ny; 
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blows = loadwav('blows.wav’); 
ny = [ (blows(l:16384)/max(abs(blows))-K).0034)’]; 
elseifW=51 
clear ny; 

br = loadwav('bnma.wav'); 
ny = [ (br/max(abs(br)) + 0.0155)' zeros(l,7268) ]; 
elseifW=52 
clear ny; 

adam = loadwav('adamsfam.wav'); 
ny = [ (adam(l:32768)/max(abs(adam)) + 0.0081)']; 
el$eifW=53 
clear ny; 
load engl6; 

ny = [ (engl6(l:16384) + 3.019e-4)’]; 


end 

end 

n = length(ny) 

D = input('Enter the finest depth for Time Splitting :'); 

% Implementing the Cosine Packet Transform 

cp = CPAnalysis(ny,D,'Sine'); 
stree = CalcStatTree(cp,'Entropy'); 

[btree,vtree] = BestBasis(stree,D); 

[n,L] = size(cp); 

% Create Bell 


bellname = 'Sine'; 
m = n / 2^D /2; 

[bp,bm] = MakeONBell(bellname,m); 

X = zeros(l,n); 

% initialize tree traversal stack 

stack = zeros(2,2^EH-l); 
tp = zeros(l,n); 

V = zeros(l,n); 
compr = zeros(l ,n); 
coef = zeros(l,n); 
ncoef = zeros(l,n); 

k=l; 

stack(:,k) = [0 0 ]*; 

V = zeros(l:n); 
vs = zeros(l:n); 
ind = 0; 

le = zeros(l,2^D); 
while(k > 0), 

d = stack( 1 ,k); b = stack(2,k); k=k-1; 
if(btree(node(d,b)) — 0), % nonterminal node 
k = k+l;stack(:,k) = [(d+l)(2*b) ]'; 
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k = k+1; stack(:,k) = [(d+1) (2*b+l)]’; 

else 

c = cp(packet(d,b,n),d+l)’; 
coef(l,b/(2M).*n+l:(b+l)/(2M).*n) = c; 
i = (b/(2'^d)*n+l); 
len = length(c); 

[ I,ND] = max(abs(c)); 
compr(l,b/(2M)*n+l) = length(c); 

% Identifying the Frequency Content of each interval 

if ND <= round(len/16) 

v(i) = 0.25; % it wasO.2 

elseif ND<= round(len/8) 

v(i) = 0.5; % it was 0.4 


elseif ND < length(c)/(2*P) 

v(i) = 1; % it was 0.6 
else 

[sI,sND] = max(abs([coef(i:i+ND-3),0,0,0,coef(i+ND+l:i+len-l)])); 
if sND >= length(c)/(2*P) 
if ND <= len/2 
v(i)=1.5; %itwas0.75; 
elseif ND <= len*3/4 
v(i) = 2; % it was 0.9 
else 

v(i) = 2.5; % it was 1.0 
end 

elseif sND > round(len/8) 
v(i) = 1; % it was 0.6 

elseif sND > roiind(len/16) 
v(i) = 0.5; % it was 0.4 
else 

v(i) = 0.25; % it was 0.2 
end 

end 

ec(i:i+len-l) = ones(l,len) .* sum(c.'^2); % computing the energy of the coefficients 
es(i:i+len-l) = ones(l,len) .* sum(ny(i:i+len-l).^2); % computing the energy of the intervals 
van = std(c); 

tp(l,b/(2.^ci).*n+l)= 1; 
len = length(c); 
ind = ind +1; 
le(ind) = log2(len); 
v(i); 

rko = length(c)/16; 
ko =ND; 
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fo = 4000/length(c)*ND; 
toten = siim(coef.^2); 
i = Cb/(2^d)*n+l); 

% Applying the Adaptive Thresholding Compression Scheme 


if v(i) <= 0,5 


if sum(coef(i:compr(i)-l+i).^2) < toten/n * len 
iflen<2*n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),99.5); 
else 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),99.5); 
end 
else 

if len <2*n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),98.7); 
else 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),97.66); 
end 
end 

nc= nncoef(i:compr(i)-l+i); 
end 

ifv(i)>0.5 

sumco = sum(coef(i:compr(i)-l+i).^2); 
thres = 0.5*toten/n * len; 
if sum(coef(i:compr(i)-l+i).^2) < toten/n * len; 
iflen<2*n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),99,5); 
else 

nncoef(i:len+i-l) = comp(coef(i:len+i-l),99.5); 
end 
else 

iflen<2*n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),99.5); 
else 

nncoef(i:compr(i)-l+i) = comp(coef(i:compr(i)-l+i),99.5); 
end 
end 


nc= nncoef(i:compr(i)-l+i); 
end 


ifv(i)> 1 
vs(i) = 1; 
else 

if es(i:i+len-l) > (toten/n *2.5* len) 
vs(i) = 0.5; 
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end 

end 

y = dct_iv(nc); % Inverse Transforming each interval 

% Unfolding each interval and Reconstructing the time sequence after compression 

[xc,xl,xr] = unfold(y,bp,bm); 
x(packet(d,b,n)) = x(packet(d,b,n)) + xc; 
ifb>0, 

x(packet(d,b-l,n)) = x(packet(d,b-l,n)) + xl; 
else 

x(packet(d,0,n)) = x(packet(d,0,n)) + edgeunfold('left’,xc,bp,bm); 
end 

ifb<2M-l, 

x(packet(d,b+l,n)) = x(packet(d,b+l,n)) + xr; 
else 

x(packet(d,b,n)) = x(packet(d,b,n)) + edgeunfoldCright’,xc,bp,bm); 
end 

end 

end 

nind = sum(le>0); 
nle = le(l:nind); 
f!gure(l),plot(ny), hold; 
plot(tp,':’),hold off; 
figure(2),plot(ny),hold 
plot(v,':’),hold off; 

mse = mean((ny - x).^2) % computing the mean square error between the original and 
% the reconstructed sequence; 

scoefinO = sum(abs(coef)>0) % computing the number of non-zero coefficients before 
% compression 

sncoefinO = sum(abs(nncoef)>0) % computing the number of non-zero coefficients after 
% compression 


figure(3), 

plot(x); 

figure(4), 

plot(ec); 

figure(5), 

plot(es); 

figure(6),specgram(ny,[], 1) 

title('Observmg the Coarticulation for the sound "ISSOS"') 

print figure6-depsc 

figure(7), 

subplot(3,1,1 ),plot(ny) 
titleCSpeech Signal: "ISSOS"') 
subplot(3,l,2),plot(ny), hold; 
plot(tp,’:’),hold off;title('Time Partition') 
subplot(3,1,3),plot(ny),hold 
plot(v,':'),hold off;title('Frequency Behavior') 
figure(8i 

subplot(3,1,1 ),plot(ny,'b'), 

%plot(v,':'),hold off; 
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titleC "Be nice to your sister”') 
subplot(3,1,2),plot(vs,'b') 
title('Voiced-Unvoiced Segmentation') 
subplot(3,l,3), 
specgram(ny,[],l) 

title('Observing The Spectogram for "Be nice to your sister”') 
print figure? -depsc 


% Necompcp.m 

% Input and loading of speech to be used 
clear; 

V = input('Please enter ”1” for female voice and ”2” for a male voice :'); 
ifV=l 
P = 2; 

FV = inputCPlease enter 1 for the sentence, 2 for "be”, 3 for "hate”, 4 for"hey'', 5 for "met” , 6 for 
"pay”, 7 for "cats”, 8 for "benice” :'); 
ifFV= 1 
clear ny; 
load fse; 

ny = [fse’ zeros(l,5120)]; 
elseifFV=2 
clear ny; 
load fbe; 

ny = [fbe' zeros(l,2048)]; 
elseifFV=3 
clear ny; 
load fha; 

ny = [fha' zeros(l,7168)]; 
elseif FV=4 
clear ny; 
load fhey; 
ny = [fhey’]; 
elseif FV=5 
clear ny 
load finet; 

ny = [finet' zeros(l,1024)]; 
elseif FV=6 
clear ny 
load fjpay 

ny = [fpay' zeros(l,1024)]; 
elseif FV=7 
clear ny 
load feats 
ny = [ feats’]; 
elseif FV=8 
clear ny 

benice = loadwav(’benice.wav'); 
ny = [(benice(l:16384)/max(abs(benice))+0.0119)’]; 
end 
end 


ifV=2 


104 


P = 2; 

W = input(Tlease enter l,for "project",2 for "cataratas",3 for "encyclopedia", 4 for "issos",5 for "assos",6 
for "six",7 for "the sentence",8 for "aka",9 for "at",10 for "azure",11 for "be",12 for "bird",13 for "boot",14 
for "call", 15 for "day", 16 for "eka",17 for "epa", 18 for "eve", 19 for "father",20 for "foot", 21 for "for", 22 
for "go", 23 for "hate", 24 for "he",25 for "ika",26 for "it",27 for "key",28 for "let",29 for "me",30 for 
"met",31 for "no",32 for "obey",33 for "opa",34 for "pay",35 for "read",36 for "see",37 for "she",38 for 
"then",39 for "thin" , 40 for "to",41 for "up", 43 for "vote",44 for "we", 45 for "you", 46 for "zoo",47 for 
"silence", 48 for "the bye sentence",49 for "beback", 50 for "blows", 51 for "bruna",52 for "adams" :'); 


ifW=l 
clear ny; 
load newvoice; 
ny = y(2700:2700+8191)’; 
elseifW=2 
clear ny; 
load catar; 

ny = ca(1900:1900+819iy; 
elseifW=3 
clear ny; 
load encic; 

ny = en(1200:1200 +8191)'; 
elseif W=4 
clear ny; 
load issos 

ny = is(1900:1900+8191)'; 
elseif W=5 
clear ny 
load assos 

ny = as(1900:1900+8191)‘; 
elseif W=6 
clear ny 
load six 

ny = si(l:8192)'; 
elseif W=7 
clear ny; 
load myvoice; 
ny = x(9000:9000+32767)'; 
elseif W=8 
clear ny; 

load aka; 

ny = (ac+0.1656)'; 
elseif W=9 
clear ny; 
load at; 

ny = (at+0.1655)’; 
elseif W= 10 
clear ny; 
load azure; 

ny = [(az+0.1651)' zeros(l,6144) ]; 
elseif W=ll 
clear ny; 
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load be; 


ny = [(be+0.1654)' zeros( 1,3072) ]; 
elseifW=12 
clear ny; 
load bird; 

ny = [(bi+0.1658)' zeros(l,7168) ]; 
elseifW=13 
clear ny; 
load boot; 

ny = [Cbo+0.1652)']; 
elseifW=14 
clear ny; 
load call; 

ny = [(cal+0.1654)' zeros(l,6144) ]; 
elseifW=15 
clear ny; 
load day; 

ny = [(da+0.1645)' zeros(l,1024) ]; 
elseifW=16 
clear ny; 
load eka; 

ny=[(ek+0.1653y]; 
elseifW=17 
clear ny; 
load epa; 

ny = [(ep+0.1650)' zeros(l,6144) ]; 
elseifW=18 
clear ny; 
load eve; 

ny = [(ev+0.1654)' zeros( 1,4096) ]; 
elseif W=19 
clear ny; 
load father; 

ny = [(fa+0.1648)' zeros(l,6144) ]; 
elseif W=20 
clear ny; 
load foot; 

ny = [(foo+0.1653)’ zeros(l,6144) ]; 
elseif W=21 
clear ny; 
load for; 

ny = [(fo+0.1649)’ zeros(l,6144) ]; 
elseif W=22 
clear ny; 
load go; 

ny = [(go+0.1651)']; 
elseif W=23 
clear ny; 
load hate; 

ny = [(ha+0.1657)' zeros(l,7168) ]; 
elseif W=24 

clear ny; 
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load he 

ny = [(he+0.1657)’ zeros(l,2048) ]; 
elseifW=25 
clear ny; 
load ika 

ny = [(ik+0.1654)' zeros(l,6144) ]; 
elseifW=26 
clear ny; 
load it 

ny = [(it+0.1657y zeros(l,3072) ]; 
elseifW=27 
clear ny; 
load key; 

ny = [(ke + 0.1652)' zeros( 1,2048)]; 
elseifW=28 
clear ny; 
load let; 

ny=[(le +0.1657)']; 
elseifW=30 
clear ny; 
load met; 

ny = [(met+ 0.1653)']; 
elseifW=31 
clear ny; 
load no; 

ny = [(no + 0.1646)' zeros(l,1024)]; 
elseifW=34 
clear ny; 
load pay; 

ny = [(pa+ 0.1655)' zeros(l,1024)]; 

elseifW=36 
load see; 

ny = [(se+0.1653)' zeros(l,1024)]; 
elseifW=37 
load she; 

ny = [(sh+0.1654)']; 
elseif W=38 
load then; 

ny = [(th+0.1656)' zeros(l,6144)]; 
elseif W=39 
load thin; 

ny = [(thi + 0.1655)' zeros( 1,1024)]; 
elseif W=40 
load to; 

ny = [(to + 0.1649)' zeros(1,3072)]; 
elseif W=41 
load up; 

ny = [(up + 0.1653)' zeros( 1,2048)]; 
elseif W=43 
load vote; 

ny = [(vo + 0.1654)' zeros(l,1024)]; 
elseif W=44 


clear ny; 
load we 

ny = [(we+0.1655)’ zeros( 1,2048) ]; 
elseifW=45 
clear ny; 
load you 

ny = [(you + 0.1655)' zeros(l,2048) ]; 
elseifW=46 
clear ny; 
load zoo 

ny = [(zo+0.1646)' zeros( 1,2048) ]; 
elseif W =47 
clear ny; 
load myvoice; 
ny = x(l:8192)'; 
elseif W =48 
clear ny; 
load bye; 

ny = [ bye’ zeros(l,9216)]; 
elseif W =49 
clear ny; 

beback = loadwav(’beback.wav'); 
ny = [ (beback/max(abs(beback))-K).056)’ zeros(l,4824)]; 
elseif W =50 
clear ny; 

blows = loadwav(’blows.wav’); 
ny = [ (blows(l:16384)/max(abs(blows))+0.0034)']; 
elseif W =51 
clear ny; 

br = loadwav(*bruna.wav'); 
ny = [ (br/max(abs(br)) + 0.0155)’ zeros(l,7268) ]; 
elseif W =52 
clear ny; 

adam = loadwav('adamsfam.wav’); 

ny = [ (adam(l :32768)/max(abs(adam)) + 0.0081)']; 


end 

end 

n = length(ny) 

D = input('Enter the finest depth for Time Splitting :'); 

% Implementing the Cosine Packet transform 

cp = CPAnalysis(ny,D,'Sine'); 
stree = CalcStatTree(cp,'Entropy'); 

[btree,vtree] = BestBasis(stree,D); % Choosing the basis by applying the Best Basis Algorithm 
[n,L] = size(cp); 

% Create Bell 

bellname = 'Sine'; 
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m = n/2^D/2; 

[bp,bm] = MakeONBell(bellname,m); 
x = zeros(l,n); 

% initialize tree traversal stack 
stack = zeros(2,2^D+l); 
tp = zeros(l,n); 

V = zeros(l,n); 
compr = zeros(l,n); 
coef=zeros(l,n); 
ncoef=zeros(l,n); 

k=l; 

stack(:,k) = [0 0 ]’; 

V = zeros(l:n); 
ind = 0; 

le = zeros(l,2'^D); 
while(k > 0), 

d = stack(l,k);b = stack(2,k); k=k-l; 
if(btree(node(d,b)) ~ 0) , % nonterminal node 
k = k+l;stack(:,k) = [(d+l)(2*b) ]’; 
k = k+1; stack(:,k) = [(d+1) (2*b+l)]‘; 
else 

c = cp(packet(d,b,n),d+l)'; 
coef(l,b/(2^d).*n+l:(b+l)/(2^d).*n) = c; 

i = (b/(2^drn+l); 

len = length(c); 

[ I,ND] = max(abs(c)); 
compr(l ,b/(2^d)*n+1) = length(c); 

% Identifying the Frequency content 

if ND <= round(len/16) 

v(i)=0.4; 

elseif ND<= round(len/8) 
v(i) = 0.6; 


elseif ND <= round(len/(16/3)) 


v(i) = 0.7; 

elseif ND <= length(c)/(2*P) 
v(i) = 0.8; 


[sIjSND] =max(abs([coef(i:i+ND-3),0,0,0,coef(i+ND+l :i+len-l)])); 
ifsND >=len/(2*P) 
v(i) = 1; 

elseif sND > round(len/(16/3)) 
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v(i) = 0.8; 

elseif sND > roimd(leny8) 
v(i) = 0.7; 

elseif sND > roimd(len/16) 
v(i) = 0,6; 
else 

v(i) = 0.4; 
end 

end 


ec = suni(c.^2); % Computing the Energy of the Coefficients 

van = std(c); 

tp(l,b/(2.M),*n+l)=l; 

len = length(c); 

ind = ind+1; 

le(ind) = log2(len); 


v(i); 

rko = length(c)/16; 
ko = ND; 

fo = 4000/length(c)*ND; 


toten = sum(coef ^2); 
i = (b/(2^d)*n+l); 

% Applying the Adaptive Thresholding Compression Scheme 


ifv(i) <=0.6 

if sum(coef(i:compr(i)-l+i).^2) < toten/n * len 
iflen<2*n/(2^D) 

nncoef(i:compr(i)-H-i) = comp((coef(i:compr(i)-l+i)),99.5); 
else 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),99.5); 
end 
else 

iflen<2*n/(2^E)) 

nncoef(i:compr(i)-H-i) = comp((coef(i;compr(i)-l+i)),99.5); 
else 

nncoef(i:compr(i)-1 +i) = comp((coef(i:compr(i)-1 +i)),98.7);%98.7 
end 
end 


nc = imcoef(i:compr(i)-l+i); 
end 


if v(i) > 0.6 

sumco = sum(coef(i:compr(i)-l+i).'^2); 
thres = 0.5*toten/n * len; 
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if sum(coef(i:compr(i>l+i).^2) < toten/n * len; % it was 0.5*toten/n*len 
iflen<2*n/(2^D) 

micoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),99.5);%0.91% for all in ”/be nice/” 
else 

nncoef(i:len+i-l) = comp(coef(i:len+i-l),99.5); 
end 
else 

iflen<2*n/(2'^D) 

nncoef(i:compr(i)-l+i) = comp((coef(i:compr(i)-l+i)),99.5); 
else 

nncoef(i;compr(i)-l+i) = comp(coef(i:compr(i)-l+i),99.5); 
end 
end 


nc = nncoef(i:compr(i)-l+i); 
end 


y = dct_iv(nc); % Inverse transforming each interval 

% Unfolding each interval and Reconstructing the time sequence after compression 

[xc,xl,xr] = unfold(y,bp,bm); 
x(packet(d,b,n)) = x(packet(d,b,n)) + xc; 
ifb>0, 

x(packet(d,b-l,n)) = x(packet(d,b-l,n)) + xl; 

else 

x(packet(d,0,n)) = x(packet(d,0,n)) + edgeunfold(’left’,xc,bp,bm); 
end 

ifb<2^d-l, 

x(packet(d,b+l,n)) = x(packet(d,b+l,n)) + xr; 
else 

x(packet(d,b,n)) = x(packet(d,b,n)) + edgeunfold(’right’,xc,bp,bm); 
end 
end 

end 

nind = sum(le>0); 
nle = le(l:nind); 
figure(l),plot(ny), hold; 
plot(tp,':’),hold off; 
print figure(l)_deps 
figure(2),plot(ny),hold 
plot(v,’:’),hold off; 
print -deps figure2 

mse = mean((ny - x).^2) % computing the mean square error between the original and 
% the reconstructed sequence; 

scoeftnO = sum(abs(coef)>0) % computing the number of non-zero coefficients before 
% compression 

sncoefinO = sum(abs(nncoef)>0) % computing the number of non-zero coefficients after 
% compression 
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figure(3), 

plot(x); 

figure(4) 

subplot(2,2, l),plot(ny,'b’) 
titleCa)”ISSOS”, ORIGINAL PLOP) 
subplot(2,2,2),plot(x,'b’) 

titleCc)AFTER FIXED THRESHOLDING(0.78%)’) 
subplot(2,2,3),specgram(ny,[], 1) 
title('b) SPECTOGRAM') 
subplot(2,2,4),specgram(x,[], 1) 
titleCd) SPECTOGRAM (AFTER)’) 


% Name: encp6.m and ndencomp.m 

% Subject: Analysis, Denoising/Compression and Synthesis of Speech data; 

% Description: These two routines contain the Denoising scheme applied prior to 
% the compression schemes; 

% The differences between the two routines are in: 

% a) The Frequency Identification implementation; for example ndencomp,m 
% makes more use of the second largest coefficient than encp6.m does; 

% b) The segmentation between voiced and unvoiced segments: encp6.m uses 
% 500 Hz for female speech and 1,000 Hz for male speech; ndencomp.m 

% uses l,000Hz for any gender; 

% 

% c) The detection of the presence of low energy speech in high 
% energy noisy background, ndencomp.m implements such a scheme, while encp6.m 
% doesn’t; 

% 

% d) The Adaptive Thresholding Compression Scheme 
% Written and adapted by J. Roberto V. Martins, October 1995; 


% Encp6.m 
clear; 

% Input and loading of speech data 


V = input(’Please enter "1” for female voice and "2" for a male voice :'); 
ifV=l 
P = 8; 

FV = input(’Please enter 1 for the sentence, 2 for "be" , 3 for "hate", 4 for "hey", 5 for "met" , 6 for 
"pay", 7 for "cats", 8 for "benice":'); 
ifFV= 1 
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clear ny; 
load fse; 

ny = [fse' zeros(l,5120)]; 

elseifFV=2 
clear ny; 
load fbe; 

ny = [fbe’ zeros( 1,2048)]; 
elseifFV=3 
clear ny; 
load fha; 

ny = [fha’ zeros(l,7168)]; 
elseif FV=4 

clear ny; ? 

load fliey; 
ny = [fhey’]; 
elseif FV=5 
clear ny 
load finet; 

ny = [finef zeros(l,1024)]; 
elseif FV=6 
clear ny 
load fpay 

ny = [fpay’ zeros(l,1024)]; 

elseif FV=7 
clear ny 
load feats 
ny = [ feats’]; 
elseif FV=8 
clear ny 

benice = loadwav(’benice.wav'); 
ny = [((benice(l:16384)/max(benice)) + 0.0119)’]; 
end 
end 

ifV=2 
P = 4; 

W = input(’Please enter l,for "project",2 for "cataratas'’,3 for "encyclopedia", 4 for ’’issos",5 for "assos’’,6 
for ’’six",7 for "the sentence’’,8 for ’’aka",9 for "at",10 for ’’azure’’,11 for "be’’,12 for ’’bird’’,13 for ’’boot",14 
for "call", 15 for "day", 16 for ’’eka",17 for "epa", 18 for "eve", 19 for ’’father’’,20 for "foot", 21 for "for", 22 
for "go", 23 for "hate", 24 for "he",25 for "ika",26 for ’’it’’,27 for "key",28 for ’’let’’,29 for’’me’’,30 for 
’’met",31 for "no’',32 for "obey’’,33 for ’’opa’’,34 for "pay’’,35 for ’’read’’,36 for"see’’,37 for ’’she",38 for 
"then’’,39 for "thin", 40 for ’’to’’,41 for "up", 43 for ’’vote’’,44 for’’we’’, 45 for "you", 46 for "zoo’’,47 for 
"silence", 48 for "the bye sentence’’,49 for "beback", 50 for "blows", 51 for "bruna", 53 for "sounds good" : 


ifW=l 
clear ny; 
load newvoice; 
ny = y(2700:2700+8191)'; 
elseif W=2 
clear ny; 
load catar; 
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ny = ca(1900:1900+8191)’; 
elseifW=3 
clear ny; 
load encic; 

ny = en(1200:1200 +8191)’; 
elseifW=4 
clear ny; 
load issos 

ny = is(1900:1900+8191)’; 
elseifW=5 
clear ny 
load assos 

ny = as(l 900:1900+8191)’; 
elseifW=6 
clear ny 
load six 

ny = si(l:8192)’; 
elseifW=7 
clear ny; 
load rayvoice; 
ny = x(9000:9000+32767)’; 
elseifW=8 
clear ny; 
load aka; 

ny = (ac+0.1656)’; 
elseifW=9 
clear ny; 
load at; 

ny = (at+0.1655)’; 

elseifW=10 
clear ny; 
load azure; 

ny = [(az+0.1651)’ zeros(l,6144) ]; 
elseif W=11 
clear ny; 
load be; 

ny = [(be+0.1654)’ zeros(l,3072) ]; 
elseif W= 12 
clear ny; 
load bird; 

ny = [(bi+0.1658)’ zeros(l,7168) ]; 
elseif W=13 
clear ny; 
load boot; 

ny = [(bo+0.1652)’]; 
elseif W=14 
clear ny; 
load call; 

ny = [(cal+0.1654)’ zeros(l,6144) ]; 
elseif W= 15 
clear ny; 
load day; 

ny = [(da+0.1645)' zeros(l,1024) ]; 


elseif W=16 
clear ny; 
load eka; 

ny = [(ek+0.1653)’]; 
elseif W= 17 
clear ny; 
load epa; 

ny = [(ep+0.1650)' zeros(l,6144) ]; 
elseif W= 18 
clear ny; 
load eve; 

ny = [(ev+0.1654)' zeros(l,4096) ]; 
elseif W= 19 
clear ny; 
load father; 

ny = [(fa+0.1648)' zeros(l,6144) ]; 
elseifW=20 
clear ny; 
load foot; 

ny = [(foo+0.1653)' zeros(l,6144) ]; 
elseif W=21 
clear ny; 
load for; 

ny = [(fo+0.1649)' zeros(l,6144) ]; 

elseif W=22 
clear ny; 
load go; 

ny = [(go+0.1651)’]; 
elseif W=23 
clear ny; 
load hate; 

ny = [(ha+0.1657)' zeros(l,7168) ]; 
elseif W=24 
clear ny; 
load he 

ny = [(he+0.1657)' zeros(l,2048) ]; 
elseif W=25 
clear ny; 
load ika 

ny = [(ik+0.1654y zeros(l,6144) ]; 
elseif W=26 
clear ny; 
load it 

ny = [(it+0.1657)' zeros(l,3072) ]; 
elseif W=27 
clear ny; 
load key; 

ny = [(ke + 0.1652)’ zeros( 1,2048)]; 
elseif W=28 
clear ny; 
load let; 

ny = [(le +0.1657)’]; 
elseif W=30 


clear ny; 
load met; 

ny = [(met + 0.1653)']; 
elseifW=31 
clear ny; 
load no; 

ny = [(no + 0.1646)' zeros(l,1024)]; 
elseifW=34 
clear ny; 
load pay; 

ny = [(pa + 0.1655)' zeros(l,1024)]; 
elseifW=36 
load see; 

ny = [(se+0.1653)’ zeros(l,1024)]; 
elseifW=37 
load she; 

ny = [(sh+0.1654)']; 
elseifW=38 
load then; 

ny = [(th+0.1656)' zeros(l,6144)]; 
elseifW=39 
load thin; 

ny = [(thi + 0.1655)’ zeros(l,1024)]; 
elseifW=40 
load to; 

ny = [(to + 0.1649)’ zeros(U3072)]; 
elseif W=41 
load up; 

ny = [(up + 0.1653)’ zeros(l,2048)]; 
elseif W=43 
load vote; 

ny = [(vo + 0.1654)’ zeros(l,1024)]; 
elseif W=44 
clear ny; 
load we 

ny = [(we+0.1655)’ zeros(l,2048) ]; 
elseif W=4 5 
clear ny; 
load you 

ny = [(you + 0.1655)’ zeros(l,2048) ]; 
elseif W==46 
clear ny; 
load zoo 

ny = [(zo+0.1646)' zeros( 1,2048) ]; 
elseif W =47 
clear ny; 
load myvoice; 
ny = x(l:8192)'; 
elseif W =48 
clear ny; 
load bye; 

ny = [ bye' zeros(l,9216)]; 
elseif W =49 
clear ny; 
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load beback; 

ny = [ (beback-127.4452)’ zeros( 1,4824)]; 
elseif W =50 
clear ny; 

blows = loadwav ('blows.wav'); 
ny = [ (blows(l:16384)+0.3027)’]; 
elseif W =51 
clear ny; 

br= loadwav(’bnina.wav’); 
ny = [ (br/max(abs(br))+0.0155)’ zeros(1,7268)]; 
elseif W =52 
clear ny; 

br = load wav(’adamsfam. wav'); 
ny = [ (adani(l:32768)/max(abs(adam))+0.0081)’]; 
elseif W =53 
clear ny; 
load engl6; 

ny = [ (engl6(l:16384) + 3.019e-4)’]; 
elseif W =54 
clear ny; 
load voiq; 

ny = [voiq(l:32768)’]; 
end 
end 


% Implementing the Cosine Packet Transform 
n = length(ny) 

D = input(’Enter the finest depth for Time Splitting :'); 
cp = CPAnalysis(ny,D,'Sine’); 
stree = CalcStatTree(cp,'Entropy'); 

[btree,vtree] = BestBasis(stree,D); 

[n,L] = size(cp); 


% Create Bell 


bellname = 'Sine'; 
m = n/2^D/2; 

[bp,bm] = MakeONBell(bellname,m); 

% 

X = zeros(l,n); 

% 

% initialize tree traversal stack 
% 

stack = zeros(2,2^D+l); 
tp = zeros(l,n); 

V = zeros(l,n); 
compr = zeros( 1 ,n); 
coef = zeros(l,n); 
ncoef=zeros(l,n); 
k= 1; 
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stack(:,k) = [0 0 ]’; 
V = zeros(l;n); 
ind = 0; 

le = zeros(l,2''D); 
while(k > 0), 


d = stack(l,k); b = stack(2,k); k=k-l; 
if(btree(node(d,b)) ~= 0) , % nonterminal node 
k = k+1; stack(:,k) = [(d+1) (2*b) ]'; 
k = k+1; stack(:,k) = [(d+1) (2*b+l)]’; 

else 

c = cp(packet(d,b,n),d+l)'; 

coef(l,b/(2^d)-*n+l :(b+l)/( 2 M).*n) = c; ^ 

i = (b/(2''d)*n+l); 
len = length(c); 

[ I,ND] = max(abs(c)); 
compr(l,b/(2^d)*n+l) = length(c); 

% Identifying the Frequency Content of each interval 

if ND <= round(len/16) 

[sI,sND] = max(abs([coef(i:i+ND-2),0,coef(i+ND:i+len-l)])); 

if (4000/len*ND) > 125 %ND <= round(len/32) 

if(4000/len*sND)<400 

if (4000/len*sND) >= 125 
v(i)=l; 

ncoef(i:compr(i)-l+i) = 

[zeros(l,roimd(len/64)),coef(i+round(len/64):i+round(len/5)-l),zeros(l,len“round(len/5)) ]; % Denoising 
else 

if (4000/len*sND) >= 60 
ncoef(i:compr(i)-l+i) = 

[zeros(l,round(len/64)),coef(i+round(len/64):i+round(len/16)-l),zeros(l,len“round(len/16)) ];%Denoising 
v(i)=l; 
else 

if (4000/len*sND) <=30 
ncoef(i:len-l+i) = zeros(l,len); % Denoising 
v(i) = 0; 
else 

ncoef(i:len-l+i) = [zeros(l,round(len/64)), coef(i+round(len/64): 
i+round(len/16)“l),zeros(l,len-round(len/16)) ]; % zeros(l,len); 

v(i)=l; 

end 

end 

end 

elseif sND < len/4 
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ncoef(i:compr(i)-1 +i) =[zeros( 1 ,len/64),coef(i+round(len/64):i+len- l)];%Denoise 
v(i)=l: 
else 


ncoef(i:compr(i)-l+i) = [ zeros(l,Ien/16),coef(i+len/16:i+len-l) ]; % Denoising 
v(i) = 2; 
end 
end 

if (4000/len*ND) <= 125 
if sND <= len/16 

ncoef(i:len-l+i) = zeros(l,len); % Denoising 
v(i) = 0; 

elseif sND <= len/4 
if (4000/len*sND) >= 300 

ncoef(i:i+len-l) = [zeros(l,sND-l),coef(i+sND"l),zeros(l,len- 
sND)];% [zeros( 1 ,len/16),coef(i+len/16: i+len-1) ]; 
v(i)=l; 
else 

ncoef(i:i+len-1) = zeros(l,len); % Denoising 
v(i) = 0; 
end 
else 

if (4000/len*ND) < 64 

ncoef(i:compr(i)-l+i) = zeros(l,len); % Denoising 
v(i)=0; 
else 

ncoef(i:compr(i)-l+i) = [ zeros(l,len/4),coef(i+len/4:i+len-l) ]; 
v(i) = 2; 
end 
end 
end 

elseif ND < length(c)/P 

[sI,sND] = max(abs([coef(i:i+ND-2),0,coef(i+ND:i+len-l)])); 
if sND < len/32 
SND =sND 

ncoef(i:compr(i) - 1+i) = zeros(l,len); % Denoising 
v(i) = 0; 
else 

v(i)= 1; 

ncoef(i:compr(i)-l+i) =[ zeros(l,len/l6),coef(i+len/16:i+len-l) ]; % Denoising 
end 
else 

[sI,sND] = max(abs([coef(i:i+ND-3),0,0,0,coef(i+NI>+l:i+len-l)])); 
if sND >= length(c)/(2*P) 
v(i) = 2; 

ncoef(i:compr(i)-l+i) = [ zeros(l,len/16),coef(i+len/16:i+len-l) ]; % Denoising 
else 

v(i)= 1; 

ncoef(i:compr(i)-l+i) = [ zeros(l,len/16),coef(i+len/16:i+len-l) ]; % Denoising 
end 
end 

ec = sum(c.^2); 
van = std(c); 
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tp(l,b/(2.M).*n+l) = l; 
len = length(c); 
ind = ind +1; 
le(ind) = log2(len); 

v(i); 

rko = length(c)/16; 
ko = ND; 

fo = 4000/length(c)*ND; 

de(md) =d; 
be(md) = b; 
toten = sum(coef.^2); 
i = (b/(2^d)*n+l); 
if v(i) = 0 

niicoef(i:compr(i)-l+i) = ncoef(i:compr(i)-l+i); 
nc = nncoef(i:compr(i)-l+i); 
end 

% Applying the Adaptive Thresholding Compression Scheme 


if v(i) = 1 


if sum(ncoef(i:compr(i)-l+i).^2) < toten/n * len 
iflen<2*n/(2^D) 

nncoef(i:compr(i)-H-i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
else 

nncoef(i:compr(i)-H-i) = comp((ncoef(i:compr(i)-H-i)),99.5); 
end 
else 

iflen<2*n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),98.7); 
else 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-H-i)),97.66); 
end 
end 


nc= nncoef(i:compr(i)-H-i); 
end 


ifv(i) = 2 

sumco = siim(coef(i:compr(i)-H-i).^2); 
thres = 0.5*toten/n * len; 

if sum(coef(i:compr(i)-l+i).^2) < toten/n * len; % it was 0.5*toten/n*len 
iflen<2*n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
else 

nncoef(i:len+i-l) = comp(ncoef(i:len+i-l),99.5); 
end 
else 

iflen<2*n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
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else 

nncoef(i:compr(i)-l+i) = comp(ncoef(i:coinpr(i)-l+i),99.5); 
end 
end 


nc= nncoef(i:compr(i)-l+i); 
end 


y = dct_iv(nc); % Inverse Transforming each interval 

% Unfolding and reconstructing the time sequence after compression 

[xc,xl,xr] = unfold(y,bp,bm); 
x(packet(d,b,n)) = x(packet(d,b,n)) + xc; 
ifb>0, 

x(packet(d,b-l,n)) = x(packet(d,b-l,n)) + xl; 

else 

x(packet(d,0,n)) = x(packet(d,0,n)) + edgeunfold(’left’,xc,bp,bm); 
end 

ifb<2^d-l, 

x(packet(d,b+l,n)) = x(packet(d,b+l,n)) + xr; 

else 

x(packet(d,b,n)) = x(packet(d,b,n)) + edgeunfold('right’,xc,bp,bm); 
end 
end 


end 

nind = sum(le>0); 
nle = le(l:nind); 

XX = x.*6; 

figure(l),plot(ny), hold; 
plot(tp,’:’),hold off; 
figure(2),p lot(ny ),hold 
plot(v,*:'),hold off; 

rase = mean((ny - x).^2) % Computing the mean sqare error between the 
% original and the reconstructed compressed one 
scoefinO = sum(abs(coef)>0) 

% Computing the number of non-zero 

% coefficients before denoising/compression 
sncoefinO = sum(abs(nncoef)>0) % Computing the number of non-zero 
% coefficients after denoising/compression 

figure(3), 

subplot(2,2,1 ),plot(ny,'b'); 

title(’MET, male speaker’); 

subplot(2,2,2),plot(x,'b’); 

title(’AFTER DENOISING/COMPRESSION’) 

subplot(2,2,3),specgram(ny); 

title(’Original Spectogram’); 

subplot(2,2,4),specgram(x); 

titleCAFTER DENOISING/COMPRESSION’) 
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% Ndencomp.m 

% Obs.: This routine has parts a), b) and c) identical to the same parts of 
% routine encp6.m. Thus we are only presenting the complement, which begins 
% in part d). 

compless = 0; 
endearly = 0; 

% Identifying the Frequency Content of each interval 
if ND <= round(len/l6); %it was len/64*3 
if ND <= round(len/64) 

[ sI,sND ] = max(abs([coef(i:i+ND-2),0,coef(i+ND:i+len-l)])); 
if (4000/len* sND) <= 300 % try to make it better!!! 
iflen>n/(2^Dr8 

coef(i+ND-l)=0; 

coef(i+sND-l)=0; 

[ tI,tND ] = max(abs(coef(i:i+len-l)));% (recovering the "ts" sound) 

% implemented to solve problems like in "cats" : it 
% still needs to be improved!! 

if tND < round(len/20) 
v(i) = 0.1; 
else 

TND = tND; 
compless = 1; 
endearly = 1; 

ncoef(i:i+len-1 )=zeros( 1 ,len);%[zeros( 1 ,len/64),coef(i+leny64:i+len-1)];% Denoising 

%[zeros( 1 ,tND-1 ),coef(i+tND-1 ),zeros( 1 jlen-tND)] ;%[zeros( 1 ,len/64),coef(i+len/64:i+len- 
1)] ;%[zeros( 1 ,tND-1 ),coef(i+tND-1 ),zeros( 1 ,len-tND)] ;%[zeros( 1 ,tND-1 ),coef(i+tND-1: i+len-1)]; 

v(i) = 0.5; 
end 

else 

v(i) = 0.1; 
end 

elseif sND <= round(len/8) 

ncoef(i:i+len-l) = [ zeros(l,sND-l),coef(i+sND-l),zeros(l,len-sND) ];% Denoising 
v(i) = 0.5; 

elseif sND <= round(len/4) 

ncoef(i:i+len-1) =[ zeros(l,sND-l),coef(i+sND-l),zeros(l,len-sND) ];% Denoising 
v(i)=1.0; 
else 

v(i) = 0.1; 
end 

else 
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[ sIjSND ] = max(abs([coef(i:i+ND-2),0,coef(i+ND:i+len-l)])); 

if sND < len/20 % see in encp6 how it was made for 125<ND<=250 & 125<=sND<400 
v(i) = 0.1; 

elseif sND<= round(len/8) 
compless = 1; % flag to indicate to compress less 
ncoef(i:i+len-l) = [zeros(l,len/32), coef(i+len/32:i+len-l)];% Denoising 
v(i) = 0.5; 

elseif sND < length(c)/(2*P) 
compless = 1; 

ncoef(i:i+len-l) = [zeros(l,len/32),coef(i+len/32:i+len-l)] ; % Denoising 
v(i)= 1; 

elseif sND<= round(len/P) 
compless = 1; 

ncoef(i:i+len-l) = [zeros(l,len/32),coef(i+len/32:i+len-l)]; % Denoising 
v(i)=1.5; 

else 

compless = 1; 

ncoef(i:i+len-l) = [zeros(l,len/32),coef(i+len/32:i+len-l)]; % Denoising 
v(i)=2; 


end 

end 

elseif ND <= round(len/8) 

ncoef (i:i+len-l)= [zeros(l,len/32), coef(i+len/32:i+len-l)];% Denoising 
v(i) = 0.25; % it wasO.2 
elseif ND<= round(len/4) 

ncoef(i:i+len-l) = [ zeros(l,len/32), coef(i+len/32:i+len-l)];% Denoising 
v(i) = 0.5; % it was 0.4 

elseif ND < length(c)/(2) 

ncoef(i:i+len-l) = [ zeros(l,len/32), coef(i+len/32:i+len-l)]; % Denoising 
v(i) = 1; % it was 0.6 
else 

[sIjSND] = max(abs([coef(i:i+ND-3),0,0,0,coef(i+ND+l:i+len-l)])); 
ifsND>= length(c)/(2*P) 
if ND <= len/2 

ncoef(i:i+len-l) = [ zeros(l,len/32), coef(i+len/32:i+len-l)] ;% Denoising 
v(i)=1.5; %it was 0.75; 
elseif ND <= round(len*3/4) 

%compless = 1; % included to help in the voice quality sentence 
ncoef(i:i+len“l) = [zeros(l,len/32),coef(i+len/32:i+len-l)];% Denoising 
v(i) = 2; % it was 0.9 
else 

ncoef(i:i+len-l) = [zeros(l,len/32),coef(i+len/32:i+len-l)];% Denoising 
v(i) = 2.5; % it was 1.0 
end 

elseif sND > round(len/8) 

ncoef(i:i+len-l) = [zeros(l,len/32),coef(i+len/32:i+len-l)];% Denoising 
v(i) = 1; % it was 0.6 

elseif sND >= round(len/16) 

ncoef(i:i+len-l) = [zeros(l,len/32),coef(i+len/32:i+len-l)];% Denoising 
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v(i) = 0.5; % it was 0.4 
else 

ncoef(i:i+len-l) =[ zeros(l,len/32),coef(i+len/32:i+len-l)];% Denoising 
v(i) = 0.25; % it was 0.2 
end 

end 

EC = sura(c.^2); 

% Computing the coefficients energy 

ec(i:i+len-l) = ones(l,len) .* sum(c.'^2); 
es(i:i+len-l) = ones(l,len) .* sum(ny(i;i+len“l).^2); 

% Computing the energy of each interval 

vari = std(c); 

%ncoef(l,b/(2^d).^n+l:(b+l)/(2^d).*n) = nc; 

tp(l,b/(2.^d).^n+l) = l; 

len = length(c); 

ind = ind +1; 

le(ind) = log2(len); 

rko = length(c)/16; 

ko = ND; 

fo = 4000/length(c)»ND 

de(ind) =d; 

be(ind) =b; 

toten = sum(coef.^2); 

i = (b/(2^d)^n+l); 

% Applying the Adaptive Thresholding Compression Scheme 
ifv(i)==0.1 

nncoef(i:i+len-l) = zeros(i:i+len-l); 
nc= nncoef(i:compr(i)-l+i); 
elseif v(i) <= 0.5 

if sum(coef(i:compr(i)-l+i).^2) < toten/n * len 
iflen<2^n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
else 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
end 
else 

iflen<2»n/(2^D) 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),98.7); 
else 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),97.66); 
end 
end 


nc = nncoef(i:compr(i)-l+i); 
end 

if v(i) > 0.5 % it was 0.4; % it was= 1 
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sumco = siim(coef(i:compr(i)-l+i).^2); 
thres = 0.5*totenyn * len; 

if sum(coef(i:compr(i)-l+i).^2) < toten/n * len; % it was 0.5*toten/n*len 
iflen<2*n/(2^D) 
if compless = 1 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),97.66); 
else 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
end 
else 

if compless = 1 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),98.7); 
else 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
end 

end 

else 

iflen<2*ny(2"^D) 
if compless = 1 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),98.7); 
else 

nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),99.5); 
end 
else 

%if compless = 1 

%nncoef(i:compr(i)-l+i) = comp((ncoef(i:compr(i)-l+i)),98.7); 
%else 

nncoef(i:compr(i)-l+i) = comp(ncoef(i:compr(i)-l+i),99.5); 

%end 

end 

end 


nc= nncoef(i:compr(i)-l+i); 


end 

if endearly = 1 

[ma,md] = max(nncoef(i:i+len-l)); 
end 

y = dct_iv(nc); 

% Inverse Transforming each interval 
if endearly = 1 

y = y.*(abs(nncoef(i:i+len-l))>0); 
end 

% Unfolding and Reconstructing the Time sequence after compression 
[xc,xl,xr] = unfold(y,bp,bm); 
x(packet(d,b,n)) = x(packet(d,b,n)) + xc; 
ifb>0, 

x(packet(d,b“l,n)) = x(packet(d,b-l,n)) + xl; 
else 

x(packet(d,0,n)) = x(packet(d,0,n)) + edgeunfold(’left’,xc,bp,bm); 
end 

ifb<2^d-l, 
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x(packet(d,b+l,n)) = x(packet(d,b+l,n)) + xr; 
else 

x(packet(d,b,n)) = x(packet(d,b,n)) + edgeunfold(’right',xc,bp,bm); 
end 
end 

end 

nind = sum(le>0); 
nle = le(l:nind); 
figtire(l),plot(ny), hold; 
plot(tp,’:*),hold off; 
figure(2),plot(ny),hold 
plot(v;:’),hold off; 
mse = mean((ny - x).^2) 

% Computing the mean sqare error between original signal and the signal after compression 

scoefmO = sum(abs(coef)>0) 

% Computing the number of non-zero-coefficients before 
% denoising/compression 
sncoefinO = sum(abs(nncoef)>0) 

% Computing the number of non-zero coefficients after 
% denoising/compression 

figure(3), 

plot(x); 

figure(4), 

plot(ec); 

figure(5), 

plot(es); 

first =1; 

nv = zeros(l,length (v)); 
nnv = zeros(l,length(v)); 
for i=l:length(v) 
if v(i) >0 
ifv(i)<=0.5 
nv(i) = 1.0; 
dist = i - first; 

%ifdist>=512 
if ec(first) > toten/(n*32)*dist 
nnv(first) = nv(first); 
nnv(i) = nv(i); 
end 

first = i; 
end 

ifv(i)>0.5 
nnv(i) =1.5; 
end 
end 
end 

XX = x.*4; 

figure(6),plot(ny),hold 
plot(nnv,':'),hold off; 
figure(7) 

subplot(2,2,1 ),plot(ny,’b‘) 
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title(’”PAY, male speaker”') 

subplot(2,2,2),plot(x,*b') 

titleCAFTER DENOISING/COMPRESSION ’) 

subplot(2,2,3),specgram(ny,[], 1) 

titleCORIGINAL SPECTOGRAM') 

subplot(2,2,4),specgram(x,[],l) 

titleCAFTER DENOISING/COMPRESSION’) 


% Name: encptour.m and ndentour.m 

% Subject: Analysis, Denoising/Compression, Encoding, Decoding and Synthesis 
% of speech data 

% Description: These two routines were applied on top of encp6.m and 
% ndencomp.m. These two programs perform the following tasks in addition to those 
% already performed by encp6.m and ndencomp.m: 

% a) Implementation of the Linear Quantizer for the Coefficients 
% vector; 

% b) Encoding of the Locations Vector; 

% c) Encoding of the positions of begining of each segment; 

% d) Huffinan coding of Coefficients Vector and for Locations vector; 

% e) Decoding of all the vectors on the Receiver’s side 

% f) Reconstruction of the Denoised/compressed sequence at the 
% receiver's side; 

% Obs.: That code is put on top of the existent codes encp6.m and ndencomp.m 
% Written by J. Roberto V. Martins, October 1995. 

[X,L,seglens,de,be] = enc(nncoef,nle,de,be); % Encoding the locations and coefficients 

[TX,prob,nprob,probdesc,N,nq,S] = quantx(X,QL); % Quantizing the coefficients 

np = length(probdesc); 

avwcoefF= huffcod(np,probdesc); % Huffinan coding the coefficients vector 
totcoefF= avwcoefPlength(TX); 
debe = [ de be ]; 
sdebe = si 2 e(debe); 

ordebe = sort(debe); 

ndebe = zeros(i,length(ordebe)); 

countdb = 1; 

ndebe(l,l) = ordebe(l); 

Indb = length(ndebe); 
for countdb=l:length(debe)-i 
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if ordebe(countdb+l) > ordebe(countdb) 

coimtdb=countdb+l; 
ndebe(countdb) = ordebe(countdb); 

end 

end 

index = 0; 
czero = 0; 

for countdb =l:length(ndebe) 

if ndebe(coimtdb) > 0 
index = index +1; 
nndebe(index) = ndebe(coimtdb); 
else 

czero = czero + 1; 
end 
end 

probdebe = nndebe/siim(nndebe); 
probdbde = fliplr(sort(probdebe)); 
nprobdbd = probdbde(l:length(probdbde)); 

avwdebe = huffcod(length(nndebe),nprobdbd); % coding the des and the bes (see chapter VII) 
totdebe = avwdebe*(length(debe) - czero) + czero; 

[DL,probl,lenprob] = difl(L); 

%mDL = max(DL(2:length(DL))); 

totndl = 0; 

pow = 1; 

for indl =l:length(DL) 
while DL(indl) > 2^ow, 
pow = pow+1; 
end 

totndl = totndl + pow; % calculating the necessary number of bits to transmit 
NDLwe're not using this 
pow = 1; 
end 
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totndl = totndl + roimd(log2(DL(l))) ;% we're using the vector DL to transmit the 
locations 


sn = 0; 

qc = zeros(l,n); 
nde = fliplr(de); 
nbe = fliplr(be); 
nseglens = fliplr(seglens); 
nv= 1; 

I = nbe./(2.^de).*n + 1; 

for ns = 1 :length(L) 
for ni = l:length(I)-l 

if L(ns) >= I(ni) 

if L(ns) <= I(ni+1) 

sn = sn+1; 

SN(sn) = ni; 

NDL(sn) = L(ns)-I(ni); 

end 

end 

end 

end 

pro = SN/sum(SN); 
nsimbsn = max(SN) - min(SN) +1; 
realpr = zeros(l,nsimbsn); 
indsn = 1; 

realpr(indsn) = pro(indsn); 
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for isn = l:length(pro)-l 
if pro(isn+l) = pro(isn) 


realpr( indsn) = realpr( indsn) + pro(isn+l); 
else 

realpr(indsn+l) = pro(isn+l); 

indsn = indsn + 1; 

end 

end 

desreapr = fliplr(sort(realpr)); 

RL(1) = DL(1); % Reconstructing L, the locations vector 

for cl = l:length(DL)-l 
RL(cl+l) = RL(cl) + DL(cRl); 
end 

for nv=l :length(nde)-l %i:i+(b/(2M)*n-l) 1 :length(nseglens) 

d = nde(nv); 
b = nbe(nv); 

i = (b/(2M)*n+l); 

nnc = qc(i:(nbe(l,nv+l)/(2^de(l,nv+l))*n)); %(i:2^seglens(v)+i-l); 
thislen = nbe(nv+l)/2^de(nv+l)*n-i+l; 

for z = i:i + (thislen-1) %(nbe(l,nv+l)/(2^de(l,nv+l))*n) %1:2^seglens(nv) 

for t = l:length(RL) 
ifz=RL(t) 

qc(z) = TX(t)/(nq/2)*S; 

end 

end 

end 
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unfprev = 0; 


un&ex = 0; 

nnc = qc(i:(nbe(l,nv+l)/(2^nde(l,nv+l))*n)); %(i:2"^seglens(v)+i-l); 

% Inverse transforming to the Time Domain, Unfolding and Reconstructing the Denoised/Compressed 
Decoded Speech Sequence 

y = dct_iv(nnc); 

[xc,xl,xr] = unfoId(y,bp,bm); 

xl(packet(d,b,n)) = xl(packet(d,b,n)) + xc; 

if nv > 1 %nv = 1 %if b>0, ' 

xl(packet(d,b-l,n)) = xl(packet(d,b-l,n)) + xl; 
else 

X1 (packet(d,b,n)) = x 1 (packet(d,b,n)) + edgeunfold('left’,xc,bp,bm); 
end 

ifb<2^d-l, 

xl(packet(d,b+l,n)) = xl(packet(d,b+l,n)) + xr; 
else 

xl(packet(d,b,n)) = xl(packet(d,b,n)) + edgeunfold('right',xc,bp,bm); 
end 


end 

figure(5),plot(ny),hold 
plot(v,':'),hold off; 

mse = mean((x - xl).'^2) % Computing the mean square error between the denoised/compressed in 
% the trasmitter and the decoded sequence in the receiver; 
scoefmO = sum(abs(coef)>0) % Computing the number of original non-zero coeflcients 
sncoefinO = sum(abs(nncoef)>0) % Computing the number of non-zero coefficients 

% afterdenoising/compression 

sqcoefinO = sum(abs(qc)>0) % Computing the number of non-zero coefficients after decoding 

figure(6), 

plot(x); 

figiire(7) 

subplot(2,3,1 ),plot(ny) 

title('Original "PAY", Female speaker’) 
subplot(2,3,2),plot(x) 
title(’After Denoising/Compression’) 
subplot(2,3,3),plot(x 1) 


title('After Decoding’) 
subplot(2,3,4),specgram(ny,[],l) 
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titleCOriginal Spectogram’) 
subplot(2,3,5),specgram(x,[], 1) 
titleCAfter Denoising/Compression') 
subplot(2,3,6),specgram(xl,[], 1) 

titleCAfter Decoding') 


TOTNBITS = totcoeff + totdebe + totndl 


TOTNSAMP = length(TX) + length(debe) + length(SN) + length(NDL) 

BITPSAMP = TOTNBITS/TOTNSAMP 

COMPRATIO = 100 - (TOTNBITS/(scoefinO*8)MOO) 


% Name: Comp.m 
% This function receives as input: 

% A vector “c” composed of coefficients and 
% a percentage number “pcenf’; 

% As an output, this function gives a vector of same length which the non-zero 
% components are the top % dominant (100 - pcent) pcent coefficients extracted from 
% that original vector; 

% Written by J. Roberto V. Martins in October of 1995. 
fimction cc = comp(c,pcent) 
d = sort(abs(c)); 

p = round(pcent/100*length(c)); 
for i = l:length(c) 
ifp=0 
cc(i) = c(i); 

elseif abs(c(i)) <= d(p) 

cc(i) = 0; 

else cc(i) = c(i); 

end 

end 

%d = (abs(c)>pcent/100*max(abs(c))); 

%cc = c.*d; 


% Name: Enc.m 

% This function receives a vector, its length and the vectors de 
% and be. As an output, it returns: 

% X, a vector with non-zero coefficients extracted from the input vector; 
% L, the vector containing the locations of the non-zero coefficients from 
% the original input vector; 

% Written by J. Roberto V. Martins in October of 1995 
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function pC,L,seglens,de,be] = enc(vector,lenvec,de,be) 


n = 0; 
m=0; 

for i = l:length(vector) 
if abs(vector(i)) > 0 
n = n + 1; 

X(n) = vector(i); 
end 
end 

for j = 1: length(vector) 
if abs(vector(j)) > 0 
m = m + 1; 
L(m)=j; 
end 
end 

seglens = lenvec; 


% Name: Difl.m 

% This function encodes a vector by transforming it into a 

% differentially encoded vector. It receives the vector to be encoded as an input and returns 
% _The differences vector 

% _The probabilities vector in descending order as well as its length 


function [DL,prob,lendl] = difl(vec) 

DL(l) = vec(l); 

for z = 2:length(vec) 

DL(z) = vec(z) - vec(z-l); 

end 

a=0; 

N = zeros(l,length(DL)); 

count = 1; 

SDL = sort(DL); 

NSDL(1) = SDL(1); 
for p = 1 :length(SDL)-l 
ifSDL(p+l)>SDL(p) 
count = count +1; 

NSDL(1,count) = SDL(l,p+l); 
end 
end 

N = zeros(l,length(NSDL)); 
forl= l:length(DL) 
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for X = 1 :length(NSDL) 
ifDL(l)=NSDL(x) 

N(x) = N(x)+ 1; 
end 
end 
end 
end 

prob = fliplr(sort(N)/sum(N)); 
lendl = length(prob); 


% Name: Quantx.m 

% This function performs the Linear Quantization proposed in this thesis for a 
% given input vector. Inputs are the vector X to be quantized and 
% the number of quantization levels desired, nq. 

% Outputs are: 

% _ TX: the quantized vector to be transmitted; 

% _ prob: The vector of probabilities of all values in the input vector; 

%_ nprob: The new vector of probabilities of all non-zero values in the input vector; 

%_ probdesc: The new probabilities vector in descending order for input to Hufftnan code; 
%_ N: The length of probdesc; 

%_ nq: The number of quantization levels (equal to the input nq); 

% _ S: The scaling factor S, i.e. the highest present absolute value in the vector; 

% Written by J.Roberto V. Martins, October 1995. 


function [TX,prob,nprob,probdesc,N,nq,S] = quantx(X,nq) 
prob = zeros(l,nq+l); 

S= max(abs(X)); 
normX = X/S; 

TX = round(normX*nq/2); 

%[N,Q] = hist(TX,length(TX)); 
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N = zeros(l,nq+l); 

STX = sort(TX); 

for s = 1 :length(TX) 
for p = -nq/2:1 :nq/2 
ifTX(s) = p 

N(l,p+nq/2+l) = N(l,p+nq/2+l) + 1; 
end 
end 
end 

prob = N/sum(N); 
t = 0; 

for s= 1 -.length (prob) 
if prob(s) > 0 
t = t+l; 

nprob( 1 ,t)=prob(l ,s); 

end 

end 

probdesc = fliplr(sort(nprob)); 


% Name: Huffcod.m 
% This flmction receives as input: 

% q , the number of symbols; and 

% p , the vector containing the probabilities of each symbol; 

% As an output, it gives the average v^ord length of the sequence; 

% The function uses the code Huf5nan.m, by K.L. Track written on 30 November 1993 
% Modifications made by J.Roberto V. Martins in October 1995. 

% HUFFMAN fmds the minimum variance Huf&nan code for the symbol 
% probabilities entered by the user. The algorithm makes use of 
% permutation matrices for the combination and sorting of probabilities. 

% Permutation matrices are used because they provide a convenient record 
% of operations, so that the codewords can then be constructed fairly easily 
% once the combination and sorting of probabilities yields just two 
% probabilities. At this point a zero is assigned to one of the 
% probabilities and a one assigned to the other. The permutation matrices 
% are used to append additional zeros and ones as appropriate to obtain 
% the fmal codeword for each symbol. 
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%% Written by K.L. Frack for EC4580 Course Project 
% Last Update: 30 November 1993 
function [ avwlen ] = huffcod(q,p) 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

% INPUT THE SYMBOLS TO BE CODED % 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

% INPUT THE NUMBER OF SYMBOLS TO BE CODED. NO TRIVIAL SOLUTION ALLOWED. 
%q=0; % q = number of symbols. Set to 0 to ensure that the loop 

% will be executed at least once 

%while q<3 % Need at least 3 symbols for a non-trivial solution 

%q=mput(’Enter the number of symbols: 

if q<3,dispCTrivial solution. Use a larger number of symbols.’), end 
%end 

% ENTER THE SYMBOL PROBABILITIES. 

% Note: The probabilities must sum to 1.00 and must be in entered in 
% descending order for the algorithm to work properly. Since the algorithm 
% will give erroneous results if these errors are overlooked, error checking 
% routines are included in later steps. 

%disp(' ’) 

%disp('Enter the symbol probabilities (in descending order).’) 

%for i=l:q, p(i)=mput([' Enter the probability of s’,int2str(i),*: ']); end 
% ENSURE THERE ARE ENOUGH PROBABILITIES ENTERED 
% If <RETURN> is inadvertently struck before a probability is entered the 
% input command could yield a probability vector which is too small. This 
% causes the program to crash. This procedure prevents this from happening 
% by setting all of the missing probabilities to zero. In this event the 
% user can correct the wrong probabilities in a later step, 
if length(p)<q, p=[p;zeros(q-length(p),l)]; end 

% ERROR CHECK THE SYMBOL PROBABILITIES 
correct='n'; % correct = 'n' ensures at least once through the error checking 
% loop. 

count=0; % count = 0 makes the loop a little simpler. It prevents the 
% program from prompting for a correction until the loop has 
% been executed at least once, 
while correct ~ ’y’ % Keep looping until correct, 
if count>0; % This procedure will be executed only if there are errors 

% to be corrected. 

s=input(’Enter the index of the incorrect probability: ’); 
p(s)=input(['Enter the correct probability for s',int2str(s),’: ’]); 
end 

count= 1; 

% Display the table. 

dispC') 

disp(’Index Symbol Probability') 

dispC-’) 

for i=l:q 

is=[int2str(i) blanks(6)]; is=is(l:7); % makes a string from the index. 
ps=[num2str(p(i)) ’000000’]; ps=ps(l:6); % makes a string from the prob. 
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disp([' s',is,' ',ps]) % displays the table 

end 

if abs(sum(p)-l)>le-8 % Ensures probabilities sum to one. 
correct = 'n'; %tinha um "beep," antes 
dispC') 

disp('Error —> Probabilities do not sum to 1.00!') 
elseif max(diff(p))>0 % Ensures probabilities are in descending order, 
correct = 'n';%tinha um "beep" antes 
dispC') 

disp('Error —> Probabilities are not in descending order!') 
else correct=input('Is the table correct? (Enter y or n): ','s'); 

% Asks the user to verify that all the probabilities are correctly 
% entered. A 'n' response will prompt the user for corrections, 
end - 

end, clear correct is ps count 
p=p'; % p must be a column vector 
pp=p; % pp = extra copy of the original probability vector 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%%%%%%%%% 

% FORM THE Q-2 PERMUTATION MATRICES (LEFT MULTIPLICATION) % 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%%%%%%%%% 

% INITIALIZE EACH MATRIX TO THE ZERO MATRIX OF APPROPRIATE DIMENSION 
for i=l:q-2, eval(['P' int2str(i) '=zeros(q-i,q-i+l);']), end 

% SUM THE LOWEST TWO PROBABILITIES AND DETERMINE NEW SORTED LOCATIONS 
for k=l :q-2 % do for each of the q-2 permutation matrices 

Sum=p(q+l-k)4-p(q-k); % sum the two lowest (and smallest) probabilities 
i=l; 

while Sum < p(i) % find highest location in p the vector for the sum 
eval(['P' int2str(k) '(i,i) = 1;']) 
i=i+l; 
end 

eval(['P' int2str(k) '(i,q-k:q-k+l) = [1 1];']) % This is the spot 
while i<q-k % form rest of matrix with the remaining probabilities 
i=i+l; 

eval(['P' int2str(k) '(i,i-l) = 1;']) 
end 

p=eval(['P' int2str(k)])*p; % multiply permutation matrix and probability 
% vector to get new probability vector. 

end, clear p Sum k 

%%%%%%%%%%%%%%%%%%%%%% 

% FORM THE SYMBOLS % 

%%%%%%%%%%%%%%%%%%%%%% 

% The symbols are formed using matrices of characters. The characters are 
% ones, zeros, and blanks. Each row in a matrix represents a codeword. The 
% final codewords are in the sO matrix. Blanks are included in the matrices 
% in order to make this part of the algorithm work efficiently. These blanks 
% are removed in a later step. 

% INITIALIZE ALL CODEWORD MATRICES TO BLANKS (Blank = 32 in ASCII) 
for i=l:q-l, eval(['s' int2str(i-l) '= 32*ones(q-i+l,q-i);']), end 

% SET RIGHTMOST CODEWORD VECTOR TO ['O''!']' (0=48 in ASCII, 1=49 in ASCII) 
eval(['s' int2str(q-2)' = [48; 49];']) 

% WORK FROM RIGHT TO LEFT USING THE P MATRICES TO FORM THE CODEWORDS 
% The codewords are formed from matrices of zeros (ASCII 48), ones (ASCII 49), 
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% or blanks (ASCII 32). Sq-1 is the rightmost matrix and has the [0 1]' 

% matrix. sO is the leftmost matrix and contains the final codewords 
% (except for extra blanks), 
for i=q-2:-l:l 

twosum=fmd((sum((eval(['P' int2str(i)])'))')=2); 

% twosum is the index of the row of the permutation matrix with two ones. 

% This is the row which accomplishes the addition of the two lowest 
% probabilities. Its index indicates where the sum is to be placed in the 
% new probability vector. This index also gives information on how to 
% form the codewords. 

onesum==fmd((sum((eval(['P’ int2str(i)])’))’)=l); 

% onesum has the indices of all the rows of the permutation matrix with 
% only single ones. The indices indicate how the probabilities will be 
% placed in the new probability vector. These indices also give 
% information on how to form the codewords. 
eval(['s’ int2str(i-l) '(l*.q-i-l,l‘q-i-l)=s’ int2str(i) '(onesum,l:q-i-l);']) 
eval(['s’ int2str(i-l) '(q-i ,l:q-i-l)=s' int2str(i) '(twosum,l:q-i-l);']) 
eval(['s' int2str(i-l) '(q-i+l,I:q-i-l)=s' int2str(i) '(twosum,l:q-i-l);']) 
eval(['s' int2str(i-l) '(q-i ,q-i)=48;']) 
eval([’s' int2str(i-l) ’(q-i+l,q-i)=49;']) 

% The five lines above place the appropriate ones, zeros, and blanks in the 
% codeword matrices as the progression moves from the right to the left. 
eval(['clear F int2str(i) ' s' int2str(i)]) 
end, clear onesum twosum 

% FIND AND REMOVE THE BLANKS FROM EACH CODEWORD AND COMPUTE WORD 

LENGTHS 

for i=l:q 

eval(['S' int2str(i) ' = (s0(i,:));’]) % sO has all the needed information 
eval(['c=find(S' int2str(i) ’= 32);']) % find all the blanks 
eval(['S' int2str(i) '(c) = [];']) % remove all the blanks 

eval(['S' int2str(i)' = setstr(S' int2str(i)');']) % convert from ASCII 
eval(['L(i)=length(S' int2str(i)');']) % compute the length of each codeword 
end, clear sO c 
avwlen = sum(L*pp); 

%%%%%%%%%%%%%%%%%%%%%%%% 

% DISPLAY THE OUTPUT % 

%%%%%%%%%%%%%%%%%%%%%%%% 

dispC) 

dispCSymbol Probability Code Word') 

dispC-•) 

for i=l:q 

is=[int2str(i) blanks(6)]; is=is(l:7); 
ps=[num2str(pp(i)) '000000']; ps=ps(l;6); 
disp([' s',is,' ',ps,' ',eval(['S' int2str(i)])]) 

end, clear is ps i q, dispC ') 

% COMPUTE AND DISPLAY AVERAGE WORD LENGTH 
L_avg=sum(L * pp); 

disp(['The average word length is ', num2str(L_avg)]) 

% COMPUTE AND DISPLAY THE ENTROPY 

H=sum(pp.*log2(l ./pp)); 

disp(['The entropy is ', num2str(H)]) 

% COMPUTE AND DISPLAY VARIANCE 

var=sum(((L_avg-L).''2)*pp); 

disp(['The variance is ', num2str(var)]) 
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